Incorrect handling of OOM in WAL replay leading to data loss

Started by Michael Paquierover 2 years ago24 messages
#1Michael Paquier
michael@paquier.xyz
2 attachment(s)

Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space. In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG: redo starts at 0/1913CD0
LOG: out of memory while trying to decode a record of length 57
LOG: redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
/messages/by-id/20200228.160100.2210969269596489579.horikyota.ntt@gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael

Attachments:

xlogreader_oom.bashtext/plain; charset=us-asciiDownload
0001-Tweak-to-force-OOM-behavior-when-replaying-records.patchtext/x-diff; charset=us-asciiDownload
From 36de193b974eed4c45391af34c36f392a0968166 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 1 Aug 2023 11:49:53 +0900
Subject: [PATCH] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 27 ++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..73006f05b8 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -547,6 +547,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	char	   *errormsg;		/* not used */
+	bool		trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -691,7 +692,31 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
-	if (decoded == NULL)
+
+#ifndef FRONTEND
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records
+	 * replayed.
+	 */
+	{
+		struct stat fstat;
+		static int counter = 0;
+
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if (decoded == NULL || trigger_oom)
 	{
 		/*
 		 * There is no space in the decode buffer.  The caller should help
-- 
2.40.1

#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#1)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space. In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Actually the code is assuming that OOM happens solely due to a broken
record length field. I believe that we intentionally put that
assumption.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
/messages/by-id/20200228.160100.2210969269596489579.horikyota.ntt@gmail.com

It doesn't change behavior unrelated to the case where the last record
is followed by zeroed trailing bytes.

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#2)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

I don't really follow this argument. The backend and the frontends
are reliable on OOM, where we generate ERRORs or even FATALs depending
on the code path involved. A memory bounded environment is something
that can easily happen if one's not careful enough with the sizing of
the instance. For example, this error can be triggered on a standby
with read-only queries that put pressure on the host's memory.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

Perhaps we could rely more on a fallback memory, especially if it is
possible to use that for the header validation. That seems like a
separate thing, still.
--
Michael

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#3)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Tue, 1 Aug 2023 14:03:36 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

I don't really follow this argument. The backend and the frontends
are reliable on OOM, where we generate ERRORs or even FATALs depending
on the code path involved. A memory bounded environment is something
that can easily happen if one's not careful enough with the sizing of

I didn't meant that OOM should not happen. I mentioned an environemnt
where allocation failure can happen while crash recovery. Anyway I
didn't meant that we shouldn't "fix" it.

the instance. For example, this error can be triggered on a standby
with read-only queries that put pressure on the host's memory.

I thoght that the failure on a stanby results in continuing to retry
reading the next record. However, I found that there's a case where
start process stops in response to OOM [1]/messages/by-id/17928-aa92416a70ff44a2@postgresql.org.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

Perhaps we could rely more on a fallback memory, especially if it is
possible to use that for the header validation. That seems like a
separate thing, still.

Once a record have been read, that size of memory is already
allocated.

While we will not agree, we could establish a defalut behavior where
an OOM during recovery immediately triggers an ERROR. Then, we could
introduce a *GUC* that causes recovery to regard OOM as an
end-of-recovery error.

regards.

[1]: /messages/by-id/17928-aa92416a70ff44a2@postgresql.org

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

While we will not agree, we could establish a defalut behavior where
an OOM during recovery immediately triggers an ERROR. Then, we could
introduce a *GUC* that causes recovery to regard OOM as an
end-of-recovery error.

If we do that, the reverse might be preferable. (OOMs are
end-of-reocvery by default. That can be changed to ERROR by GUC.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Aleksander Alekseev
aleksander@timescale.com
In reply to: Michael Paquier (#1)
Re: Incorrect handling of OOM in WAL replay leading to data loss

Hi,

As far as I can see, PerformWalRecovery() uses LOG as elevel
[...]
On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Probably I'm missing something, but if memory allocation is required
during WAL replay and it fails, wouldn't it be a better solution to
log the error and terminate the DBMS immediately?

Clearly Postgres doesn't have control of the amount of memory
available. It's up to the DBA to resolve the problem and start the
recovery again. If this happens on a replica, it indicates a
misconfiguration of the system and/or lack of the corresponding
configuration options.

Maybe a certain amount of memory should be reserved for the WAL replay
and perhaps other needs. In the recent case the system should account
for the overcommitment of the OS - cases when a successful malloc()
doesn't necessarily allocate the required amount of *physical* memory,
as it's done on Linux.

--
Best regards,
Aleksander Alekseev

#7Jeff Davis
pgsql@j-davis.com
In reply to: Aleksander Alekseev (#6)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote:

Probably I'm missing something, but if memory allocation is required
during WAL replay and it fails, wouldn't it be a better solution to
log the error and terminate the DBMS immediately?

We need to differentiate between:

1. No valid record exists and it must be the end of WAL; LOG and start
up.

2. A valid record exists and we are unable to process it (e.g. due to
OOM); PANIC.

Regards,
Jeff Davis

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#4)
1 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I thoght that the failure on a stanby results in continuing to retry
reading the next record. However, I found that there's a case where
start process stops in response to OOM [1].

I've examined the calls to
MemoryContextAllocExtended(..,MCXT_ALLOC_NO_OOM). In server recovery
path, XLogDecodeNextRecord is the only function that uses it.

So, there doesn't seem to be a problem here. I proceeded to test the
idea of only varifying headers after an allocation failure, and I've
attached a PoC.

- allocate_recordbuf() ensures a minimum of SizeOfXLogRecord bytes
when it reutnrs false, indicating an allocation failure.

- If allocate_recordbuf() returns false, XLogDecodeNextRecord()
continues to read pages and perform header checks until the
total_len reached, but not copying data (except for the logical
record header, when the first page didn't store the entire header).

- If all relevant WAL pages are consistent, ReadRecord concludes with
an 'out of memory' ERROR, which then escalates to FATAL.

I believe this approach is sufficient to determine whether the error
is OOM or not. If total_len is currupted and has an excessively large
value, it's highly unlikely that all subsequent pages for that length
will be consistent.

Do you have any thoughts on this?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

PoC_continue_record_verification_after_OOM.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..a6c57d50bf 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -150,6 +150,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		return NULL;
 	}
 	state->errormsg_buf[0] = '\0';
+	state->errormsg_fatal =false;
 
 	/*
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
@@ -196,6 +197,7 @@ XLogReaderFree(XLogReaderState *state)
 static bool
 allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 {
+	static char fallback_buf[SizeOfXLogRecord];
 	uint32		newSize = reclength;
 
 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
@@ -224,9 +226,12 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 		pfree(state->readRecordBuf);
 	state->readRecordBuf =
 		(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
+
 	if (state->readRecordBuf == NULL)
 	{
-		state->readRecordBufSize = 0;
+		/* failed to allocate buffer. use the fallback buffer instead */
+		state->readRecordBufSize = SizeOfXLogRecord;
+		state->readRecordBuf = fallback_buf;
 		return false;
 	}
 	state->readRecordBufSize = newSize;
@@ -547,6 +552,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	char	   *errormsg;		/* not used */
+	bool		verify_only;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -589,6 +595,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	}
 
 restart:
+	verify_only = false;
 	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
@@ -702,8 +709,12 @@ restart:
 
 		/* We failed to allocate memory for an oversized record. */
 		report_invalid_record(state,
-							  "out of memory while trying to decode a record of length %u", total_len);
-		goto err;
+							  "out of memory while trying to decode a record of length %u at %X/%X",
+							  total_len, LSN_FORMAT_ARGS(RecPtr));
+		if (gotheader)
+			state->errormsg_fatal = true;
+					
+		verify_only = true;
 	}
 
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
@@ -719,19 +730,19 @@ restart:
 
 		/*
 		 * Enlarge readRecordBuf as needed.
+		 * If failed, continue only performing verification.
 		 */
-		if (total_len > state->readRecordBufSize &&
+		if (!verify_only &&
+			total_len > state->readRecordBufSize &&
 			!allocate_recordbuf(state, total_len))
-		{
-			/* We treat this as a "bogus data" condition */
-			report_invalid_record(state, "record length %u at %X/%X too long",
-								  total_len, LSN_FORMAT_ARGS(RecPtr));
-			goto err;
-		}
+			verify_only = true;
 
 		/* Copy the first fragment of the record from the first page. */
-		memcpy(state->readRecordBuf,
-			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+		Assert(!verify_only || gotheader || len < SizeOfXLogRecord);
+		if (!verify_only || !gotheader)
+			memcpy(state->readRecordBuf,
+				   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+			
 		buffer = state->readRecordBuf + len;
 		gotlen = len;
 
@@ -811,13 +822,23 @@ restart:
 				readOff = ReadPageInternal(state, targetPagePtr,
 										   pageHeaderSize + len);
 
-			memcpy(buffer, (char *) contdata, len);
+			if (!verify_only)
+				memcpy(buffer, (char *) contdata, len);
+				
 			buffer += len;
 			gotlen += len;
 
 			/* If we just reassembled the record header, validate it. */
 			if (!gotheader)
 			{
+				if (verify_only)
+				{
+					Assert(gotlen - len < SizeOfXLogRecord &&
+						   gotlen + len >= SizeOfXLogRecord);
+					memcpy(buffer, (char *) contdata,
+						   SizeOfXLogRecord - (gotlen - len));
+				}
+					
 				record = (XLogRecord *) state->readRecordBuf;
 				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
@@ -828,6 +849,15 @@ restart:
 
 		Assert(gotheader);
 
+		if (verify_only)
+		{
+			report_invalid_record(state,
+								  "out of memory while trying to decode a record of length %u at %X/%X",
+								  total_len, LSN_FORMAT_ARGS(RecPtr));
+			state->errormsg_fatal = true;
+			goto err;
+		}
+
 		record = (XLogRecord *) state->readRecordBuf;
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..f37d85254f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3099,8 +3099,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * shouldn't loop anymore in that case.
 			 */
 			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			{
+				int tmp_emode =
+					emode_for_corrupt_record(emode, xlogreader->EndRecPtr);
+
+				if (xlogreader->errormsg_fatal)
+					tmp_emode = ERROR;
+				ereport(tmp_emode,
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			}
+			
 		}
 
 		/*
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..a2fca907e4 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -310,7 +310,8 @@ struct XLogReaderState
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 	bool		errormsg_deferred;
-
+	bool		errormsg_fatal;
+	
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
 	 * data.
#9Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#7)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Aug 01, 2023 at 04:39:54PM -0700, Jeff Davis wrote:

On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote:

Probably I'm missing something, but if memory allocation is required
during WAL replay and it fails, wouldn't it be a better solution to
log the error and terminate the DBMS immediately?

We need to differentiate between:

1. No valid record exists and it must be the end of WAL; LOG and start
up.

2. A valid record exists and we are unable to process it (e.g. due to
OOM); PANIC.

Yes, still there is a bit more to it. The origin of the introduction
to palloc(MCXT_ALLOC_NO_OOM) partially comes from this thread, that
has reported a problem where we switched from malloc() to palloc()
when xlogreader.c got introduced:
/messages/by-id/CAHGQGwE46cJC4rJGv+kVMV8g6BxHm9dBR_7_QdPjvJUqdt7m=Q@mail.gmail.com

And the malloc() behavior when replaying WAL records is even older
than that.

At the end, we want to be able to give more options to anybody looking
at WAL records, and let them take decisions based on the error reached
and the state of the system. For example, it does not make much sense
to fail hard on OOM if replaying records when in standby mode because
we can just loop again. The same can actually be said when in crash
recovery. On OOM, the startup process considers that we have an
invalid record now, which is incorrect. We could fail hard and FATAL
to replay again (sounds like the natural option), or we could loop
over the record that failed its allocation, repeating things. In any
case, we need to give more information back to the system so as it can
take better decisions on what it should do.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#8)
3 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:

I believe this approach is sufficient to determine whether the error
is OOM or not. If total_len is currupted and has an excessively large
value, it's highly unlikely that all subsequent pages for that length
will be consistent.

Do you have any thoughts on this?

This could be more flexible IMO, and actually in some cases
errormsg_fatal may be eaten if using the WAL prefetcher as the error
message is reported with the caller of XLogPrefetcherReadRecord(), no?

Anything that has been discussed on this thread now involves a change
in XLogReaderState that induces an ABI breakage. For HEAD, we are
likely going in this direction, but if we are going to bite the bullet
we'd better be a bit more aggressive with the integration and report
an error code side-by-side with the error message returned by
XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
as all of the callers can decide what they want to do on an invalid
record or just an OOM.

Attached is the idea of infrastructure I have in mind, as of 0001,
where this adds an error code to report_invalid_record(). For now
this includes three error codes appended to the error messages
generated that can be expanded if need be: no error, OOM and invalid
data. The invalid data part may needs to be much more verbose, and
could be improved to make this stuff "less scary" as the other thread
proposes, but what I have here would be enough to trigger a different
decision in the startup process if a record cannot be fetched on OOM
or if there's a different reason behind that.

0002 is an example of decision that can be taken in WAL replay if we
see an OOM, based on the error code received. One argument is that we
may want to improve emode_for_corrupt_record() so as it reacts better
on OOM, upgrading the emode wanted, but this needs more analysis
depending on the code path involved.

0003 is my previous trick to inject an OOM failure at replay. Reusing
the previous script, this would be enough to prevent an early redo
creating a loss of data.

Note that we have a few other things going in the tree. As one
example, pg_walinspect would consider an OOM as the end of WAL. Not
critical, still slightly incorrect as the end of WAL may not have been
reached yet so it can report some incorrect information depending on
what the WAL reader faces. This could be improved with the additions
of 0001.

Thoughts or comments?
--
Michael

Attachments:

v1-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patchtext/x-diff; charset=us-asciiDownload
From acc80d3a6199dd7b28848579cb723e8af837a198 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 8 Aug 2023 16:13:56 +0900
Subject: [PATCH v1 1/3] Add infrastructure to report error codes in WAL reader

This adds a field named errorcode to XLogReaderState, while the APIs in
charge of reading the next WAL records report an error code in parallel
of the error read.
---
 src/include/access/xlogprefetcher.h           |  3 +-
 src/include/access/xlogreader.h               | 16 +++-
 src/backend/access/transam/twophase.c         |  3 +-
 src/backend/access/transam/xlogprefetcher.c   |  5 +-
 src/backend/access/transam/xlogreader.c       | 86 +++++++++++++++----
 src/backend/access/transam/xlogrecovery.c     |  3 +-
 src/backend/replication/logical/logical.c     |  3 +-
 .../replication/logical/logicalfuncs.c        |  3 +-
 src/backend/replication/slotfuncs.c           |  3 +-
 src/backend/replication/walsender.c           |  3 +-
 src/bin/pg_rewind/parsexlog.c                 |  9 +-
 src/bin/pg_waldump/pg_waldump.c               |  3 +-
 contrib/pg_walinspect/pg_walinspect.c         |  3 +-
 13 files changed, 112 insertions(+), 31 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..7f80ed922f 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,8 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+											char **errmsg,
+											XLogReaderError *errorcode);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..24554de10f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,14 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderState.reason */
+typedef enum XLogReaderError
+{
+	XLOG_READER_NONE = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderError;
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -310,6 +318,8 @@ struct XLogReaderState
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 	bool		errormsg_deferred;
+	/* Error code when filling errormsg_buf */
+	XLogReaderError	errorcode;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -355,11 +365,13 @@ typedef enum XLogPageReadResult
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+										 char **errormsg,
+										 XLogReaderError *errorcode);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
-										 char **errormsg);
+										 char **errormsg,
+										 XLogReaderError *errorcode);
 
 /* Release the previously returned record, if necessary. */
 extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..79ed829e7a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1400,6 +1400,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1413,7 +1414,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errormsg, &errorcode);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..87ed7aa7b1 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -984,7 +984,8 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
 XLogRecord *
-XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg,
+						 XLogReaderError *errcode)
 {
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
@@ -1052,7 +1053,7 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/* Read the next record. */
-	record = XLogNextRecord(prefetcher->reader, errmsg);
+	record = XLogNextRecord(prefetcher->reader, errmsg, errcode);
 	if (!record)
 		return NULL;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..8fbf2d3513 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -41,8 +41,10 @@
 #include "common/logging.h"
 #endif
 
-static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
-			pg_attribute_printf(2, 3);
+static void report_invalid_record(XLogReaderState *state,
+								  XLogReaderError errorcode,
+								  const char *fmt,...)
+			pg_attribute_printf(3, 4);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
@@ -70,7 +72,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
  * the current record being read.
  */
 static void
-report_invalid_record(XLogReaderState *state, const char *fmt,...)
+report_invalid_record(XLogReaderState *state, XLogReaderError errorcode,
+					  const char *fmt,...)
 {
 	va_list		args;
 
@@ -81,6 +84,7 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_end(args);
 
 	state->errormsg_deferred = true;
+	state->errorcode = errorcode;
 }
 
 /*
@@ -355,7 +359,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
  * valid until the next call to XLogNextRecord.
  */
 DecodedXLogRecord *
-XLogNextRecord(XLogReaderState *state, char **errormsg)
+XLogNextRecord(XLogReaderState *state, char **errormsg,
+			   XLogReaderError *errorcode)
 {
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
@@ -367,7 +372,10 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 		{
 			if (state->errormsg_buf[0] != '\0')
 				*errormsg = state->errormsg_buf;
+			if (state->errorcode != XLOG_READER_NONE)
+				*errorcode =  state->errorcode;
 			state->errormsg_deferred = false;
+			state->errorcode = XLOG_READER_NONE;
 		}
 
 		/*
@@ -419,7 +427,8 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * valid until the next call to XLogReadRecord.
  */
 XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecord(XLogReaderState *state, char **errormsg,
+			   XLogReaderError *errorcode)
 {
 	DecodedXLogRecord *decoded;
 
@@ -437,7 +446,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		XLogReadAhead(state, false /* nonblocking */ );
 
 	/* Consume the head record or error. */
-	decoded = XLogNextRecord(state, errormsg);
+	decoded = XLogNextRecord(state, errormsg, errorcode);
 	if (decoded)
 	{
 		/*
@@ -623,7 +632,9 @@ restart:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "invalid record offset at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  pageHeaderSize, targetRecOff);
 		goto err;
@@ -632,7 +643,9 @@ restart:
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "contrecord is requested by %X/%X",
 							  LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
@@ -673,6 +686,7 @@ restart:
 		if (total_len < SizeOfXLogRecord)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid record length at %X/%X: expected at least %u, got %u",
 								  LSN_FORMAT_ARGS(RecPtr),
 								  (uint32) SizeOfXLogRecord, total_len);
@@ -702,6 +716,7 @@ restart:
 
 		/* We failed to allocate memory for an oversized record. */
 		report_invalid_record(state,
+							  XLOG_READER_OOM,
 							  "out of memory while trying to decode a record of length %u", total_len);
 		goto err;
 	}
@@ -724,7 +739,9 @@ restart:
 			!allocate_recordbuf(state, total_len))
 		{
 			/* We treat this as a "bogus data" condition */
-			report_invalid_record(state, "record length %u at %X/%X too long",
+			report_invalid_record(state,
+								  XLOG_READER_OOM,
+								  "record length %u at %X/%X too long",
 								  total_len, LSN_FORMAT_ARGS(RecPtr));
 			goto err;
 		}
@@ -773,6 +790,7 @@ restart:
 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "there is no contrecord flag at %X/%X",
 									  LSN_FORMAT_ARGS(RecPtr));
 				goto err;
@@ -786,6 +804,7 @@ restart:
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "invalid contrecord length %u (expected %lld) at %X/%X",
 									  pageHeader->xlp_rem_len,
 									  ((long long) total_len) - gotlen,
@@ -1116,6 +1135,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid record length at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
@@ -1124,6 +1144,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid resource manager ID %u at %X/%X",
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
@@ -1137,6 +1158,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (!(record->xl_prev < RecPtr))
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1153,6 +1175,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1189,6 +1212,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "incorrect resource manager data checksum in record at %X/%X",
 							  LSN_FORMAT_ARGS(recptr));
 		return false;
@@ -1223,6 +1247,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_magic,
 							  fname,
@@ -1238,6 +1263,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1254,6 +1280,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			longhdr->xlp_sysid != state->system_identifier)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
 								  (unsigned long long) longhdr->xlp_sysid,
 								  (unsigned long long) state->system_identifier);
@@ -1262,12 +1289,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect segment size in page header");
 			return false;
 		}
 		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
 			return false;
 		}
@@ -1280,6 +1309,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1300,6 +1330,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
@@ -1326,6 +1357,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
 								  hdr->xlp_tli,
 								  state->latestPageTLI,
@@ -1349,6 +1381,7 @@ XLogReaderResetError(XLogReaderState *state)
 {
 	state->errormsg_buf[0] = '\0';
 	state->errormsg_deferred = false;
+	state->errorcode = XLOG_READER_NONE;
 }
 
 /*
@@ -1369,6 +1402,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	XLogReaderError	errorcode;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -1453,7 +1487,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg, &errorcode) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1600,6 +1634,7 @@ ResetDecoder(XLogReaderState *state)
 	/* Clear error state. */
 	state->errormsg_buf[0] = '\0';
 	state->errormsg_deferred = false;
+	state->errorcode = XLOG_READER_NONE;
 }
 
 /*
@@ -1732,6 +1767,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "out-of-order block_id %u at %X/%X",
 									  block_id,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1756,6 +1792,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (blk->has_data && blk->data_len == 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
@@ -1763,6 +1800,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (!blk->has_data && blk->data_len != 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
 									  (unsigned int) blk->data_len,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1799,6 +1837,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					 blk->bimg_len == BLCKSZ))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1815,6 +1854,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					(blk->hole_offset != 0 || blk->hole_length != 0))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1829,6 +1869,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len == BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
 										  (unsigned int) blk->bimg_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1844,6 +1885,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len != BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
 										  (unsigned int) blk->data_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1860,6 +1902,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				if (rlocator == NULL)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
@@ -1872,6 +1915,7 @@ DecodeXLogRecord(XLogReaderState *state,
 		else
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid block_id %u at %X/%X",
 								  block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
 			goto err;
@@ -1939,6 +1983,7 @@ DecodeXLogRecord(XLogReaderState *state,
 
 shortdata_err:
 	report_invalid_record(state,
+						  XLOG_READER_INVALID_DATA,
 						  "record with invalid length at %X/%X",
 						  LSN_FORMAT_ARGS(state->ReadRecPtr));
 err:
@@ -2049,6 +2094,7 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		!record->record->blocks[block_id].in_use)
 	{
 		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
 							  "could not restore image at %X/%X with invalid block %d specified",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
@@ -2056,7 +2102,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	}
 	if (!record->record->blocks[block_id].has_image)
 	{
-		report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
+		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
+							  "could not restore image at %X/%X with invalid state, block %d",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
 		return false;
@@ -2083,7 +2131,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 									bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "LZ4",
 								  block_id);
@@ -2100,7 +2150,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 			if (ZSTD_isError(decomp_result))
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "zstd",
 								  block_id);
@@ -2109,7 +2161,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		}
 		else
 		{
-			report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with unknown method, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
@@ -2117,7 +2171,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 		if (!decomp_success)
 		{
-			report_invalid_record(record, "could not decompress image at %X/%X, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not decompress image at %X/%X, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..06b00c7c46 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3063,8 +3063,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogReaderError	errorcode;
 
-		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg, &errorcode);
 		if (record == NULL)
 		{
 			/*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 41243d0187..ea39bc5353 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -642,9 +642,10 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	{
 		XLogRecord *record;
 		char	   *err = NULL;
+		XLogReaderError code;
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
+		record = XLogReadRecord(ctx->reader, &err, &code);
 		if (err)
 			elog(ERROR, "could not find logical decoding starting point: %s", err);
 		if (!record)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 55a24c02c9..e411543111 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -245,8 +245,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		{
 			XLogRecord *record;
 			char	   *errm = NULL;
+			XLogReaderError errcode;
 
-			record = XLogReadRecord(ctx->reader, &errm);
+			record = XLogReadRecord(ctx->reader, &errm, &errcode);
 			if (errm)
 				elog(ERROR, "could not find record for logical decoding: %s", errm);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6035cf4816..e09c641a0b 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -505,12 +505,13 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		{
 			char	   *errm = NULL;
 			XLogRecord *record;
+			XLogReaderError errcode;
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
+			record = XLogReadRecord(ctx->reader, &errm, &errcode);
 			if (errm)
 				elog(ERROR, "could not find record while advancing replication slot: %s",
 					 errm);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d27ef2985d..39b10c7570 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3046,6 +3046,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogReaderError errcode;
 
 	/*
 	 * We'll use the current flush point to determine whether we've caught up.
@@ -3063,7 +3064,7 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	record = XLogReadRecord(logical_decoding_ctx->reader, &errm, &errcode);
 
 	/* xlog record was invalid */
 	if (errm != NULL)
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..2fcbf6f904 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -69,6 +69,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 	XLogPageReadPrivate private;
 
 	private.tliIndex = tliIndex;
@@ -82,7 +83,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errormsg, &errorcode);
 
 		if (record == NULL)
 		{
@@ -127,6 +128,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
@@ -139,7 +141,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errormsg, &errorcode);
 	if (record == NULL)
 	{
 		if (errormsg)
@@ -174,6 +176,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 	XLogPageReadPrivate private;
 
 	/*
@@ -204,7 +207,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errormsg, &errorcode);
 
 		if (record == NULL)
 		{
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e8b5a6cd61..cc38e3752f 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -797,6 +797,7 @@ main(int argc, char **argv)
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
@@ -1239,7 +1240,7 @@ main(int argc, char **argv)
 		}
 
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		record = XLogReadRecord(xlogreader_state, &errormsg, &errorcode);
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 796a74f322..dea9837611 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -147,8 +147,9 @@ ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
 	char	   *errormsg;
+	XLogReaderError errorcode;
 
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errormsg, &errorcode);
 
 	if (record == NULL)
 	{
-- 
2.40.1

v1-0002-Force-a-FATAL-when-facing-OOM-in-WAL-replay.patchtext/x-diff; charset=us-asciiDownload
From c5c0c5d3b5fc43760da0fa4ed9466077af573056 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 8 Aug 2023 16:22:18 +0900
Subject: [PATCH v1 2/3] Force a FATAL when facing OOM in WAL replay

---
 src/backend/access/transam/xlogrecovery.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 06b00c7c46..e3b156ec15 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3098,9 +3098,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * failed - in that case we already logged something. In
 			 * StandbyMode that only happens if we have been triggered, so we
 			 * shouldn't loop anymore in that case.
+			 *
+			 * If we failed because of an out-of-memory problem, just give up
+			 * and retry recovery later.  It may be posible that the WAL record
+			 * to decode required a larger memory allocation than what the host
+			 * can offer.
 			 */
 			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+				ereport(errorcode == XLOG_READER_OOM ?
+						FATAL : emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
-- 
2.40.1

v1-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patchtext/x-diff; charset=us-asciiDownload
From a8621af2716d92133d91ae372777743643f9b3be Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 1 Aug 2023 11:49:53 +0900
Subject: [PATCH v1 3/3] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 27 ++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8fbf2d3513..b9dc5a780b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -556,6 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	char	   *errormsg;		/* not used */
+	bool		trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -705,7 +706,31 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
-	if (decoded == NULL)
+
+#ifndef FRONTEND
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records
+	 * replayed.
+	 */
+	{
+		struct stat fstat;
+		static int counter = 0;
+
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if (decoded == NULL || trigger_oom)
 	{
 		/*
 		 * There is no space in the decode buffer.  The caller should help
-- 
2.40.1

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#10)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Tue, 8 Aug 2023 16:29:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:

I believe this approach is sufficient to determine whether the error
is OOM or not. If total_len is currupted and has an excessively large
value, it's highly unlikely that all subsequent pages for that length
will be consistent.

Do you have any thoughts on this?

This could be more flexible IMO, and actually in some cases
errormsg_fatal may be eaten if using the WAL prefetcher as the error
message is reported with the caller of XLogPrefetcherReadRecord(), no?

Right. The goal of my PoC was to detect OOM accurately or at least
sufficiently so. We need to separately pass the "error code" along
with the message to make it work with the prefethcer. We could
enclose errormsg and errorcode in a struct.

Anything that has been discussed on this thread now involves a change
in XLogReaderState that induces an ABI breakage. For HEAD, we are
likely going in this direction, but if we are going to bite the bullet
we'd better be a bit more aggressive with the integration and report
an error code side-by-side with the error message returned by
XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
as all of the callers can decide what they want to do on an invalid
record or just an OOM.

Sounds reasonable.

Attached is the idea of infrastructure I have in mind, as of 0001,
where this adds an error code to report_invalid_record(). For now
this includes three error codes appended to the error messages
generated that can be expanded if need be: no error, OOM and invalid
data. The invalid data part may needs to be much more verbose, and
could be improved to make this stuff "less scary" as the other thread
proposes, but what I have here would be enough to trigger a different
decision in the startup process if a record cannot be fetched on OOM
or if there's a different reason behind that.

Agreed. This clarifies the basis for decisions at the upper layer
(ReadRecord) and adds flexibility.

0002 is an example of decision that can be taken in WAL replay if we
see an OOM, based on the error code received. One argument is that we
may want to improve emode_for_corrupt_record() so as it reacts better
on OOM, upgrading the emode wanted, but this needs more analysis
depending on the code path involved.

0003 is my previous trick to inject an OOM failure at replay. Reusing
the previous script, this would be enough to prevent an early redo
creating a loss of data.

Note that we have a few other things going in the tree. As one
example, pg_walinspect would consider an OOM as the end of WAL. Not
critical, still slightly incorrect as the end of WAL may not have been
reached yet so it can report some incorrect information depending on
what the WAL reader faces. This could be improved with the additions
of 0001.

Thoughts or comments?

I like the overall direction. Though, I'm considering enclosing the
errormsg and errorcode in a struct.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#11)
3 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Aug 08, 2023 at 05:44:03PM +0900, Kyotaro Horiguchi wrote:

I like the overall direction. Though, I'm considering enclosing the
errormsg and errorcode in a struct.

Yes, this suggestion makes sense as it simplifies all the WAL routines
that need to report back a complete error state, and there are four of
them now:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

I have spent more time on 0001, polishing it and fixing a few bugs
that I have found while reviewing the whole.  Most of them were
related to mistakes in resetting the error state when expected.  I
have also expanded DecodeXLogRecord() to use an error structure
instead of only an errmsg, giving more consistency.  The error state
now relies on two structures:
+typedef enum XLogReaderErrorCode
+{
+   XLOG_READER_NONE = 0,
+   XLOG_READER_OOM,            /* out-of-memory */
+   XLOG_READER_INVALID_DATA,   /* record data */
+} XLogReaderErrorCode;
+typedef struct XLogReaderError
+{
+   /* Buffer to hold error message */
+   char       *message;
+   bool        message_deferred;
+   /* Error code when filling *message */
+   XLogReaderErrorCode code;
+} XLogReaderError;

I'm kind of happy with this layer, now.

I have also spent some time on finding a more elegant solution for the
WAL replay, relying on the new facility from 0001. And it happens
that it is easy enough to loop if facing an out-of-memory failure when
reading a record when we are in crash recovery, as the state is
actually close to what a standby does. The trick is that we should
not change the state and avoid tracking a continuation record. This
is done in 0002, making replay more robust. With the addition of the
error injection tweak in 0003, I am able to finish recovery while the
startup process loops if under memory pressure. As mentioned
previously, there are more code paths to consider, but that's a start
to fix the data loss problems.

Comments are welcome.
--
Michael

Attachments:

v2-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patchtext/x-diff; charset=us-asciiDownload
From 9da3751ef46be90cf7a5d4050c5cb281c33dbf70 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 13:47:30 +0900
Subject: [PATCH v2 1/3] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h           |   2 +-
 src/include/access/xlogreader.h               |  33 +++-
 src/backend/access/transam/twophase.c         |   8 +-
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c       | 167 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  14 +-
 src/backend/access/transam/xlogutils.c        |   2 +-
 src/backend/replication/logical/logical.c     |   9 +-
 .../replication/logical/logicalfuncs.c        |   9 +-
 src/backend/replication/slotfuncs.c           |   8 +-
 src/backend/replication/walsender.c           |   8 +-
 src/bin/pg_rewind/parsexlog.c                 |  24 +--
 src/bin/pg_waldump/pg_waldump.c               |  10 +-
 contrib/pg_walinspect/pg_walinspect.c         |  11 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 200 insertions(+), 117 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+											XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..2b57b5eb01 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,25 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NONE = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	bool		message_deferred;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -307,9 +326,8 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-	/* Buffer to hold error message */
-	char	   *errormsg_buf;
-	bool		errormsg_deferred;
+	/* Error state data */
+	XLogReaderError errordata;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -324,7 +342,8 @@ struct XLogReaderState
 static inline bool
 XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
 {
-	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+	return (state->decode_queue_head != NULL) ||
+		state->errordata.message_deferred;
 }
 
 /* Get a new XLogReader */
@@ -355,11 +374,11 @@ typedef enum XLogPageReadResult
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Release the previously returned record, if necessary. */
 extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
@@ -399,7 +418,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 							 DecodedXLogRecord *decoded,
 							 XLogRecord *record,
 							 XLogRecPtr lsn,
-							 char **errormsg);
+							 XLogReaderError *errordata);
 
 /*
  * Macros that provide access to parts of the record most recently returned by
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..08bd6586ec 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1399,7 +1399,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1413,15 +1413,15 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read two-phase state from WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(lsn), errormsg)));
+							LSN_FORMAT_ARGS(lsn), errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..d16acd5e49 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -953,7 +953,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		DecodedXLogRecord *decoded;
 		StringInfoData buf;
 		StringInfoData recordBuf;
-		char	   *errormsg = NULL;
+		XLogReaderError	errordata = {0};
 		MemoryContext oldCxt;
 
 		oldCxt = MemoryContextSwitchTo(walDebugCxt);
@@ -987,10 +987,10 @@ XLogInsertRecord(XLogRecData *rdata,
 								   decoded,
 								   record,
 								   EndPos,
-								   &errormsg))
+								   &errordata))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
-							 errormsg ? errormsg : "no error message");
+							 errordata.message ? errordata.message : "no error message");
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..92d691ca49 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -984,7 +984,7 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
 XLogRecord *
-XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, XLogReaderError *errdata)
 {
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
@@ -1052,7 +1052,7 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/* Read the next record. */
-	record = XLogNextRecord(prefetcher->reader, errmsg);
+	record = XLogNextRecord(prefetcher->reader, errdata);
 	if (!record)
 		return NULL;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..891e38e6e7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -41,8 +41,10 @@
 #include "common/logging.h"
 #endif
 
-static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
-			pg_attribute_printf(2, 3);
+static void report_invalid_record(XLogReaderState *state,
+								  XLogReaderErrorCode errorcode,
+								  const char *fmt,...)
+			pg_attribute_printf(3, 4);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
@@ -66,21 +68,23 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 #define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
 
 /*
- * Construct a string in state->errormsg_buf explaining what's wrong with
+ * Construct a string in state->errordata.message explaining what's wrong with
  * the current record being read.
  */
 static void
-report_invalid_record(XLogReaderState *state, const char *fmt,...)
+report_invalid_record(XLogReaderState *state, XLogReaderErrorCode errorcode,
+					  const char *fmt,...)
 {
 	va_list		args;
 
 	fmt = _(fmt);
 
 	va_start(args, fmt);
-	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
+	vsnprintf(state->errordata.message, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
 
-	state->errormsg_deferred = true;
+	state->errordata.message_deferred = true;
+	state->errordata.code = errorcode;
 }
 
 /*
@@ -141,15 +145,16 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
 	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
-	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
-										  MCXT_ALLOC_NO_OOM);
-	if (!state->errormsg_buf)
+	state->errordata.message = palloc_extended(MAX_ERRORMSG_LEN + 1,
+											   MCXT_ALLOC_NO_OOM);
+	if (!state->errordata.message)
 	{
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
 	}
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NONE;
 
 	/*
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
@@ -157,7 +162,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 */
 	if (!allocate_recordbuf(state, 0))
 	{
-		pfree(state->errormsg_buf);
+		pfree(state->errordata.message);
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
@@ -175,7 +180,7 @@ XLogReaderFree(XLogReaderState *state)
 	if (state->decode_buffer && state->free_decode_buffer)
 		pfree(state->decode_buffer);
 
-	pfree(state->errormsg_buf);
+	pfree(state->errordata.message);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	pfree(state->readBuf);
@@ -351,23 +356,27 @@ XLogReleasePreviousRecord(XLogReaderState *state)
  *
  * On success, a record is returned.
  *
- * The returned record (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogNextRecord.
+ * The returned record (or errordata->message) points to an internal buffer
+ * that's valid until the next call to XLogNextRecord.
  */
 DecodedXLogRecord *
-XLogNextRecord(XLogReaderState *state, char **errormsg)
+XLogNextRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
 
 	if (state->decode_queue_head == NULL)
 	{
-		*errormsg = NULL;
-		if (state->errormsg_deferred)
+		errordata->message = NULL;
+		errordata->code = XLOG_READER_NONE;
+		if (state->errordata.message_deferred)
 		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			state->errormsg_deferred = false;
+			if (state->errordata.message[0] != '\0')
+				errordata->message = state->errordata.message;
+			if (state->errordata.code != XLOG_READER_NONE)
+				errordata->code = state->errordata.code;
+			state->errordata.message_deferred = false;
+			state->errordata.code = XLOG_READER_NONE;
 		}
 
 		/*
@@ -397,7 +406,8 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	state->ReadRecPtr = state->record->lsn;
 	state->EndRecPtr = state->record->next_lsn;
 
-	*errormsg = NULL;
+	errordata->message = NULL;
+	errordata->code = XLOG_READER_NONE;
 
 	return state->record;
 }
@@ -409,17 +419,17 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error;
+ * errordata->message is set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
- * *errormsg is set to a string with details of the failure.
+ * *errordata is set with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * The returned pointer (or *errordata.message) points to an internal
+ * buffer that's valid until the next call to XLogReadRecord.
  */
 XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	DecodedXLogRecord *decoded;
 
@@ -437,7 +447,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		XLogReadAhead(state, false /* nonblocking */ );
 
 	/* Consume the head record or error. */
-	decoded = XLogNextRecord(state, errormsg);
+	decoded = XLogNextRecord(state, errordata);
 	if (decoded)
 	{
 		/*
@@ -546,7 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	bool		gotheader;
 	int			readOff;
 	DecodedXLogRecord *decoded;
-	char	   *errormsg;		/* not used */
+	XLogReaderError errordata = {0};		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -556,7 +566,8 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	randAccess = false;
 
 	/* reset error state */
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NONE;
 	decoded = NULL;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -623,7 +634,9 @@ restart:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "invalid record offset at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  pageHeaderSize, targetRecOff);
 		goto err;
@@ -632,7 +645,9 @@ restart:
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "contrecord is requested by %X/%X",
 							  LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
@@ -673,6 +688,7 @@ restart:
 		if (total_len < SizeOfXLogRecord)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid record length at %X/%X: expected at least %u, got %u",
 								  LSN_FORMAT_ARGS(RecPtr),
 								  (uint32) SizeOfXLogRecord, total_len);
@@ -691,6 +707,7 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
+
 	if (decoded == NULL)
 	{
 		/*
@@ -702,6 +719,7 @@ restart:
 
 		/* We failed to allocate memory for an oversized record. */
 		report_invalid_record(state,
+							  XLOG_READER_OOM,
 							  "out of memory while trying to decode a record of length %u", total_len);
 		goto err;
 	}
@@ -724,7 +742,9 @@ restart:
 			!allocate_recordbuf(state, total_len))
 		{
 			/* We treat this as a "bogus data" condition */
-			report_invalid_record(state, "record length %u at %X/%X too long",
+			report_invalid_record(state,
+								  XLOG_READER_OOM,
+								  "record length %u at %X/%X too long",
 								  total_len, LSN_FORMAT_ARGS(RecPtr));
 			goto err;
 		}
@@ -773,6 +793,7 @@ restart:
 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "there is no contrecord flag at %X/%X",
 									  LSN_FORMAT_ARGS(RecPtr));
 				goto err;
@@ -786,6 +807,7 @@ restart:
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "invalid contrecord length %u (expected %lld) at %X/%X",
 									  pageHeader->xlp_rem_len,
 									  ((long long) total_len) - gotlen,
@@ -867,7 +889,7 @@ restart:
 		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errordata))
 	{
 		/* Record the location of the next record. */
 		decoded->next_lsn = state->NextRecPtr;
@@ -918,7 +940,7 @@ err:
 		 * queued so that XLogPrefetcherReadRecord() doesn't bring us back a
 		 * second time and clobber the above state.
 		 */
-		state->errormsg_deferred = true;
+		state->errordata.message_deferred = true;
 	}
 
 	if (decoded && decoded->oversized)
@@ -931,9 +953,9 @@ err:
 	XLogReaderInvalReadState(state);
 
 	/*
-	 * If an error was written to errmsg_buf, it'll be returned to the caller
-	 * of XLogReadRecord() after all successfully decoded records from the
-	 * read queue.
+	 * If an error was written to errordata.message, it'll be returned to the
+	 * caller of XLogReadRecord() after all successfully decoded records from
+	 * the read queue.
 	 */
 
 	return XLREAD_FAIL;
@@ -952,7 +974,7 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
 {
 	XLogPageReadResult result;
 
-	if (state->errormsg_deferred)
+	if (state->errordata.message_deferred)
 		return NULL;
 
 	result = XLogDecodeNextRecord(state, nonblocking);
@@ -970,8 +992,8 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
  * via the page_read() callback.
  *
  * Returns XLREAD_FAIL if the required page cannot be read for some
- * reason; errormsg_buf is set in that case (unless the error occurs in the
- * page_read callback).
+ * reason; errordata.message is set in that case (unless the error occurs in
+ * the page_read callback).
  *
  * Returns XLREAD_WOULDBLOCK if the requested data can't be read without
  * waiting.  This can be returned only if the installed page_read callback
@@ -1116,6 +1138,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid record length at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
@@ -1124,6 +1147,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid resource manager ID %u at %X/%X",
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
@@ -1137,6 +1161,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (!(record->xl_prev < RecPtr))
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1153,6 +1178,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1189,6 +1215,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "incorrect resource manager data checksum in record at %X/%X",
 							  LSN_FORMAT_ARGS(recptr));
 		return false;
@@ -1223,6 +1250,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_magic,
 							  fname,
@@ -1238,6 +1266,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1254,6 +1283,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			longhdr->xlp_sysid != state->system_identifier)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
 								  (unsigned long long) longhdr->xlp_sysid,
 								  (unsigned long long) state->system_identifier);
@@ -1262,12 +1292,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect segment size in page header");
 			return false;
 		}
 		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
 			return false;
 		}
@@ -1280,6 +1312,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1300,6 +1333,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
@@ -1326,6 +1360,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
 								  hdr->xlp_tli,
 								  state->latestPageTLI,
@@ -1347,8 +1382,9 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 void
 XLogReaderResetError(XLogReaderState *state)
 {
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata.message_deferred = false;
+	state->errordata.code = XLOG_READER_NONE;
 }
 
 /*
@@ -1368,7 +1404,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -1453,7 +1489,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errordata) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1598,8 +1634,9 @@ ResetDecoder(XLogReaderState *state)
 	state->decode_buffer_head = state->decode_buffer;
 
 	/* Clear error state. */
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata.message_deferred = false;
+	state->errordata.code = XLOG_READER_NONE;
 }
 
 /*
@@ -1649,7 +1686,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				 DecodedXLogRecord *decoded,
 				 XLogRecord *record,
 				 XLogRecPtr lsn,
-				 char **errormsg)
+				 XLogReaderError *errordata)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1732,6 +1769,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "out-of-order block_id %u at %X/%X",
 									  block_id,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1756,6 +1794,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (blk->has_data && blk->data_len == 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
@@ -1763,6 +1802,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (!blk->has_data && blk->data_len != 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
 									  (unsigned int) blk->data_len,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1799,6 +1839,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					 blk->bimg_len == BLCKSZ))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1815,6 +1856,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					(blk->hole_offset != 0 || blk->hole_length != 0))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1829,6 +1871,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len == BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
 										  (unsigned int) blk->bimg_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1844,6 +1887,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len != BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
 										  (unsigned int) blk->data_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1860,6 +1904,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				if (rlocator == NULL)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
@@ -1872,6 +1917,7 @@ DecodeXLogRecord(XLogReaderState *state,
 		else
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid block_id %u at %X/%X",
 								  block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
 			goto err;
@@ -1939,10 +1985,12 @@ DecodeXLogRecord(XLogReaderState *state,
 
 shortdata_err:
 	report_invalid_record(state,
+						  XLOG_READER_INVALID_DATA,
 						  "record with invalid length at %X/%X",
 						  LSN_FORMAT_ARGS(state->ReadRecPtr));
 err:
-	*errormsg = state->errormsg_buf;
+	errordata->message = state->errordata.message;
+	errordata->code = state->errordata.code;
 
 	return false;
 }
@@ -2049,6 +2097,7 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		!record->record->blocks[block_id].in_use)
 	{
 		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
 							  "could not restore image at %X/%X with invalid block %d specified",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
@@ -2056,7 +2105,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	}
 	if (!record->record->blocks[block_id].has_image)
 	{
-		report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
+		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
+							  "could not restore image at %X/%X with invalid state, block %d",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
 		return false;
@@ -2083,7 +2134,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 									bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "LZ4",
 								  block_id);
@@ -2100,7 +2153,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 			if (ZSTD_isError(decomp_result))
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "zstd",
 								  block_id);
@@ -2109,7 +2164,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		}
 		else
 		{
-			report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with unknown method, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
@@ -2117,7 +2174,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 		if (!decomp_success)
 		{
-			report_invalid_record(record, "could not decompress image at %X/%X, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not decompress image at %X/%X, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..68100bfa4a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2454,7 +2454,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		if (!RestoreBlockImage(record, block_id, primary_image_masked))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * If masking function is defined, mask both the primary and replay
@@ -3062,9 +3062,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 	for (;;)
 	{
-		char	   *errormsg;
+		XLogReaderError errordata = {0};
 
-		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
 			/*
@@ -3098,9 +3098,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * StandbyMode that only happens if we have been triggered, so we
 			 * shouldn't loop anymore in that case.
 			 */
-			if (errormsg)
+			if (errordata.message)
 				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+						(errmsg_internal("%s", errordata.message) /* already translated */ ));
 		}
 
 		/*
@@ -3385,9 +3385,9 @@ retry:
 		 * Emit this error right now then retry this page immediately. Use
 		 * errmsg_internal() because the message was already translated.
 		 */
-		if (xlogreader->errormsg_buf[0])
+		if (xlogreader->errordata.message[0])
 			ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-					(errmsg_internal("%s", xlogreader->errormsg_buf)));
+					(errmsg_internal("%s", xlogreader->errordata.message)));
 
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		XLogReaderResetError(xlogreader);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index e174a2a891..5c64454e7e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -395,7 +395,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 		if (!RestoreBlockImage(record, block_id, page))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * The page may be uninitialized. If so, we can't set the LSN because
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 41243d0187..f48feab944 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -641,12 +641,13 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	for (;;)
 	{
 		XLogRecord *record;
-		char	   *err = NULL;
+		XLogReaderError errordata = {0};
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
-		if (err)
-			elog(ERROR, "could not find logical decoding starting point: %s", err);
+		record = XLogReadRecord(ctx->reader, &errordata);
+		if (errordata.message)
+			elog(ERROR, "could not find logical decoding starting point: %s",
+				 errordata.message);
 		if (!record)
 			elog(ERROR, "could not find logical decoding starting point");
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 55a24c02c9..e7f74809e3 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,11 +244,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		while (ctx->reader->EndRecPtr < end_of_wal)
 		{
 			XLogRecord *record;
-			char	   *errm = NULL;
+			XLogReaderError errordata = {0};
 
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
-				elog(ERROR, "could not find record for logical decoding: %s", errm);
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
+				elog(ERROR, "could not find record for logical decoding: %s",
+					 errordata.message);
 
 			/*
 			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6035cf4816..4fa4e6bfed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -503,17 +503,17 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		/* Decode at least one record, until we run out of records */
 		while (ctx->reader->EndRecPtr < moveto)
 		{
-			char	   *errm = NULL;
 			XLogRecord *record;
+			XLogReaderError errordata = {0};
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
 				elog(ERROR, "could not find record while advancing replication slot: %s",
-					 errm);
+					 errordata.message);
 
 			/*
 			 * Process the record.  Storage-level changes are ignored in
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d27ef2985d..d05c60f09f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3045,7 +3045,7 @@ static void
 XLogSendLogical(void)
 {
 	XLogRecord *record;
-	char	   *errm;
+	XLogReaderError errordata = {0};
 
 	/*
 	 * We'll use the current flush point to determine whether we've caught up.
@@ -3063,12 +3063,12 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	record = XLogReadRecord(logical_decoding_ctx->reader, &errordata);
 
 	/* xlog record was invalid */
-	if (errm != NULL)
+	if (errordata.message != NULL)
 		elog(ERROR, "could not find record while sending logically-decoded data: %s",
-			 errm);
+			 errordata.message);
 
 	if (record != NULL)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..2705d9bf45 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -68,7 +68,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	private.tliIndex = tliIndex;
@@ -82,16 +82,16 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
 			XLogRecPtr	errptr = xlogreader->EndRecPtr;
 
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not read WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(errptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not read WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(errptr));
@@ -126,7 +126,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
@@ -139,12 +139,12 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			pg_fatal("could not read WAL record at %X/%X: %s",
-					 LSN_FORMAT_ARGS(ptr), errormsg);
+					 LSN_FORMAT_ARGS(ptr), errordata.message);
 		else
 			pg_fatal("could not read WAL record at %X/%X",
 					 LSN_FORMAT_ARGS(ptr));
@@ -173,7 +173,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecord *record;
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	/*
@@ -204,14 +204,14 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not find previous WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(searchptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not find previous WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(searchptr));
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e8b5a6cd61..4129ba901b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -508,7 +508,7 @@ XLogRecordSaveFPWs(XLogReaderState *record, const char *savepath)
 
 		/* Full page exists, so let's save it */
 		if (!RestoreBlockImage(record, block_id, page))
-			pg_fatal("%s", record->errormsg_buf);
+			pg_fatal("%s", record->errordata.message);
 
 		(void) XLogRecGetBlockTagExtended(record, block_id,
 										  &rnode, &fork, &blk, NULL);
@@ -796,7 +796,7 @@ main(int argc, char **argv)
 	XLogRecord *record;
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
@@ -1239,7 +1239,7 @@ main(int argc, char **argv)
 		}
 
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		record = XLogReadRecord(xlogreader_state, &errordata);
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
@@ -1304,10 +1304,10 @@ main(int argc, char **argv)
 	if (time_to_stop)
 		exit(0);
 
-	if (errormsg)
+	if (errordata.message)
 		pg_fatal("error in WAL record at %X/%X: %s",
 				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+				 errordata.message);
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 796a74f322..e7d30554ed 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -146,9 +146,9 @@ static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
@@ -161,11 +161,12 @@ ReadNextXLogRecord(XLogReaderState *xlogreader)
 		if (private_data->end_of_wal)
 			return NULL;
 
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+							LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+							errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -384,7 +385,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
 			if (!RestoreBlockImage(record, block_id, page))
 				ereport(ERROR,
 						(errcode(ERRCODE_INTERNAL_ERROR),
-						 errmsg_internal("%s", record->errormsg_buf)));
+						 errmsg_internal("%s", record->errordata.message)));
 
 			block_fpi_data = (bytea *) palloc(BLCKSZ + VARHDRSZ);
 			SET_VARSIZE(block_fpi_data, BLCKSZ + VARHDRSZ);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 66823bc2a7..53ce72c4c2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3077,6 +3077,8 @@ XLogPageReadResult
 XLogPrefetchStats
 XLogPrefetcher
 XLogPrefetcherFilter
+XLogReaderError
+XLogReaderErrorCode
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.40.1

v2-0002-Make-WAL-replay-more-robust-on-OOM-failures.patchtext/x-diff; charset=us-asciiDownload
From e374c0ea3a9e775f44ec63582895e4596c65da1c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 14:53:26 +0900
Subject: [PATCH v2 2/3] Make WAL replay more robust on OOM failures

This takes advantage of the new error facility for WAL readers, allowing
WAL replay to loop when an out-of-memory happens when reading a record.
This was the origin of potential data loss scenarios, making WAL replay
more robust by acting like a standby here.
---
 src/backend/access/transam/xlogrecovery.c | 75 ++++++++++++++++-------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 68100bfa4a..d6811c8678 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3067,29 +3067,50 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
-			/*
-			 * When we find that WAL ends in an incomplete record, keep track
-			 * of that record.  After recovery is done, we'll write a record
-			 * to indicate to downstream WAL readers that that portion is to
-			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
-			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+			switch (errordata.code)
 			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+				case XLOG_READER_NONE:
+					/* Possible when XLogPageRead() has failed */
+					Assert(!errordata.message);
+					/* FALLTHROUGH */
 
-			if (readFile >= 0)
-			{
-				close(readFile);
-				readFile = -1;
+				case XLOG_READER_INVALID_DATA:
+
+					/*
+					 * When we find that WAL ends in an incomplete record,
+					 * keep track of that record.  After recovery is done,
+					 * we'll write a record to indicate to downstream WAL
+					 * readers that that portion is to be ignored.
+					 *
+					 * However, when ArchiveRecoveryRequested = true, we're
+					 * going to switch to a new timeline at the end of
+					 * recovery. We will only copy WAL over to the new
+					 * timeline up to the end of the last complete record, so
+					 * if we did this, we would later create an overwrite
+					 * contrecord in the wrong place, breaking everything.
+					 */
+					if (!ArchiveRecoveryRequested &&
+						!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+					{
+						abortedRecPtr = xlogreader->abortedRecPtr;
+						missingContrecPtr = xlogreader->missingContrecPtr;
+					}
+
+					if (readFile >= 0)
+					{
+						close(readFile);
+						readFile = -1;
+					}
+					break;
+				case XLOG_READER_OOM:
+
+					/*
+					 * If we failed because of an out-of-memory problem, just
+					 * give up and retry recovery later.  It may be posible
+					 * that the WAL record to decode required a larger memory
+					 * allocation than what the host can offer.
+					 */
+					break;
 			}
 
 			/*
@@ -3147,9 +3168,12 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * WAL from the archive, even if pg_wal is completely empty, but
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
+			 *
+			 * It may be possible that the record was not decoded because of
+			 * an out-of-memory failure.  In this case, just loop.
 			 */
 			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+				!fetching_ckpt && errordata.code != XLOG_READER_OOM)
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
@@ -3173,9 +3197,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
+			/*
+			 * In standby mode or if the WAL record failed on a out-of-memory,
+			 * loop back to retry.  Otherwise, give up.
+			 */
 			if (StandbyMode && !CheckForStandbyTrigger())
 				continue;
+			else if (errordata.code == XLOG_READER_OOM)
+				continue;
 			else
 				return NULL;
 		}
-- 
2.40.1

v2-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patchtext/x-diff; charset=us-asciiDownload
From 8c834db3da2b5391cebeff7404a8cc8ca15b58b2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 14:53:44 +0900
Subject: [PATCH v2 3/3] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 26 ++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 891e38e6e7..c5f7985d88 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -557,6 +557,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	XLogReaderError errordata = {0};		/* not used */
+	bool		trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -708,7 +709,30 @@ restart:
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
 
-	if (decoded == NULL)
+#ifndef FRONTEND
+
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records replayed.
+	 */
+	{
+		struct stat fstat;
+		static int	counter = 0;
+
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if (decoded == NULL || trigger_oom)
 	{
 		/*
 		 * There is no space in the decode buffer.  The caller should help
-- 
2.40.1

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#12)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Wed, 9 Aug 2023 15:03:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in

I have spent more time on 0001, polishing it and fixing a few bugs
that I have found while reviewing the whole.  Most of them were
related to mistakes in resetting the error state when expected.  I
have also expanded DecodeXLogRecord() to use an error structure
instead of only an errmsg, giving more consistency.  The error state
now relies on two structures:
+typedef enum XLogReaderErrorCode
+{
+   XLOG_READER_NONE = 0,
+   XLOG_READER_OOM,            /* out-of-memory */
+   XLOG_READER_INVALID_DATA,   /* record data */
+} XLogReaderErrorCode;
+typedef struct XLogReaderError
+{
+   /* Buffer to hold error message */
+   char       *message;
+   bool        message_deferred;
+   /* Error code when filling *message */
+   XLogReaderErrorCode code;
+} XLogReaderError;

I'm kind of happy with this layer, now.

I'm not certain if message_deferred is a property of the error
struct. Callers don't seem to need that information.

The name "XLOG_RADER_NONE" seems too generic. XLOG_READER_NOERROR will
be clearer.

I have also spent some time on finding a more elegant solution for the
WAL replay, relying on the new facility from 0001. And it happens
that it is easy enough to loop if facing an out-of-memory failure when
reading a record when we are in crash recovery, as the state is
actually close to what a standby does. The trick is that we should
not change the state and avoid tracking a continuation record. This
is done in 0002, making replay more robust. With the addition of the

0002 shifts the behavior for the OOM case from ending recovery to
retrying at the same record. If the last record is really corrupted,
the server won't be able to finish recovery. I doubt we are good with
this behavior change.

error injection tweak in 0003, I am able to finish recovery while the
startup process loops if under memory pressure. As mentioned
previously, there are more code paths to consider, but that's a start
to fix the data loss problems.

(The file name "xlogreader_oom" is a bit trickeier to type than "hoge"
or "foo"X( )

Comments are welcome.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#13)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Wed, Aug 09, 2023 at 04:13:53PM +0900, Kyotaro Horiguchi wrote:

I'm not certain if message_deferred is a property of the error
struct. Callers don't seem to need that information.

True enough, will remove.

The name "XLOG_RADER_NONE" seems too generic. XLOG_READER_NOERROR will
be clearer.

Or perhaps just XLOG_READER_NO_ERROR?

0002 shifts the behavior for the OOM case from ending recovery to
retrying at the same record. If the last record is really corrupted,
the server won't be able to finish recovery. I doubt we are good with
this behavior change.

You mean on an incorrect xl_tot_len? Yes that could be possible.
Another possibility would be a retry logic with an hardcoded number of
attempts and a delay between each. Once the infrastructure is in
place, this still deserves more discussions but we can be flexible.
The immediate FATAL is choice.
--
Michael

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#14)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Wed, 9 Aug 2023 16:35:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in

Or perhaps just XLOG_READER_NO_ERROR?

Looks fine.

0002 shifts the behavior for the OOM case from ending recovery to
retrying at the same record. If the last record is really corrupted,
the server won't be able to finish recovery. I doubt we are good with
this behavior change.

You mean on an incorrect xl_tot_len? Yes that could be possible.
Another possibility would be a retry logic with an hardcoded number of
attempts and a delay between each. Once the infrastructure is in
place, this still deserves more discussions but we can be flexible.
The immediate FATAL is choice.

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]/messages/by-id/17928-aa92416a70ff44a2@postgresql.org

I'm glad to see this infrastructure comes in, and I'm on board with
retrying due to an OOM. However, I think we really need official steps
to wrap up recovery when there is a truly broken, oversized
xl_tot_len.

[1]: /messages/by-id/17928-aa92416a70ff44a2@postgresql.org

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#15)
3 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Wed, Aug 09, 2023 at 05:00:49PM +0900, Kyotaro Horiguchi wrote:

Looks fine.

Okay, I've updated the patch in consequence. I'll look at 0001 again
at the beginning of next week.

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]

Right, I remember this one. I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

I'm glad to see this infrastructure comes in, and I'm on board with
retrying due to an OOM. However, I think we really need official steps
to wrap up recovery when there is a truly broken, oversized
xl_tot_len.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.
I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough. At least that's enough
for the cases we've seen. I'll think a bit more about it, as well.

Yeah, agreed. That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/
--
Michael

Attachments:

v3-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patchtext/x-diff; charset=us-asciiDownload
From 62406086b3d2046a3ce1d7d84d51e6ce4721b885 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 14:53:44 +0900
Subject: [PATCH v3 3/3] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 26 ++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c29b8ff387..ed43360f78 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -557,6 +557,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	XLogReaderError errordata = {0};		/* not used */
+	bool		trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -708,7 +709,30 @@ restart:
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
 
-	if (decoded == NULL)
+#ifndef FRONTEND
+
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records replayed.
+	 */
+	{
+		struct stat fstat;
+		static int	counter = 0;
+
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if (decoded == NULL || trigger_oom)
 	{
 		/*
 		 * There is no space in the decode buffer.  The caller should help
-- 
2.40.1

v3-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patchtext/x-diff; charset=us-asciiDownload
From 1274883c4d06dec8876ec60ad983749b7fae3946 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 17:39:52 +0900
Subject: [PATCH v3 1/3] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h           |   2 +-
 src/include/access/xlogreader.h               |  33 +++-
 src/backend/access/transam/twophase.c         |   8 +-
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c       | 167 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  14 +-
 src/backend/access/transam/xlogutils.c        |   2 +-
 src/backend/replication/logical/logical.c     |   9 +-
 .../replication/logical/logicalfuncs.c        |   9 +-
 src/backend/replication/slotfuncs.c           |   8 +-
 src/backend/replication/walsender.c           |   8 +-
 src/bin/pg_rewind/parsexlog.c                 |  24 +--
 src/bin/pg_waldump/pg_waldump.c               |  10 +-
 contrib/pg_walinspect/pg_walinspect.c         |  11 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 200 insertions(+), 117 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+											XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..06664dc6fb 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,24 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NO_ERROR = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -307,9 +325,9 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-	/* Buffer to hold error message */
-	char	   *errormsg_buf;
-	bool		errormsg_deferred;
+	/* Error state data */
+	XLogReaderError errordata;
+	bool		errordata_deferred;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -324,7 +342,8 @@ struct XLogReaderState
 static inline bool
 XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
 {
-	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+	return (state->decode_queue_head != NULL) ||
+		state->errordata_deferred;
 }
 
 /* Get a new XLogReader */
@@ -355,11 +374,11 @@ typedef enum XLogPageReadResult
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Release the previously returned record, if necessary. */
 extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
@@ -399,7 +418,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 							 DecodedXLogRecord *decoded,
 							 XLogRecord *record,
 							 XLogRecPtr lsn,
-							 char **errormsg);
+							 XLogReaderError *errordata);
 
 /*
  * Macros that provide access to parts of the record most recently returned by
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..08bd6586ec 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1399,7 +1399,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1413,15 +1413,15 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read two-phase state from WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(lsn), errormsg)));
+							LSN_FORMAT_ARGS(lsn), errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..d16acd5e49 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -953,7 +953,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		DecodedXLogRecord *decoded;
 		StringInfoData buf;
 		StringInfoData recordBuf;
-		char	   *errormsg = NULL;
+		XLogReaderError	errordata = {0};
 		MemoryContext oldCxt;
 
 		oldCxt = MemoryContextSwitchTo(walDebugCxt);
@@ -987,10 +987,10 @@ XLogInsertRecord(XLogRecData *rdata,
 								   decoded,
 								   record,
 								   EndPos,
-								   &errormsg))
+								   &errordata))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
-							 errormsg ? errormsg : "no error message");
+							 errordata.message ? errordata.message : "no error message");
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..92d691ca49 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -984,7 +984,7 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
 XLogRecord *
-XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, XLogReaderError *errdata)
 {
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
@@ -1052,7 +1052,7 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/* Read the next record. */
-	record = XLogNextRecord(prefetcher->reader, errmsg);
+	record = XLogNextRecord(prefetcher->reader, errdata);
 	if (!record)
 		return NULL;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..c29b8ff387 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -41,8 +41,10 @@
 #include "common/logging.h"
 #endif
 
-static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
-			pg_attribute_printf(2, 3);
+static void report_invalid_record(XLogReaderState *state,
+								  XLogReaderErrorCode errorcode,
+								  const char *fmt,...)
+			pg_attribute_printf(3, 4);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
@@ -66,21 +68,23 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 #define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
 
 /*
- * Construct a string in state->errormsg_buf explaining what's wrong with
+ * Construct a string in state->errordata.message explaining what's wrong with
  * the current record being read.
  */
 static void
-report_invalid_record(XLogReaderState *state, const char *fmt,...)
+report_invalid_record(XLogReaderState *state, XLogReaderErrorCode errorcode,
+					  const char *fmt,...)
 {
 	va_list		args;
 
 	fmt = _(fmt);
 
 	va_start(args, fmt);
-	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
+	vsnprintf(state->errordata.message, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
 
-	state->errormsg_deferred = true;
+	state->errordata_deferred = true;
+	state->errordata.code = errorcode;
 }
 
 /*
@@ -141,15 +145,16 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
 	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
-	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
-										  MCXT_ALLOC_NO_OOM);
-	if (!state->errormsg_buf)
+	state->errordata.message = palloc_extended(MAX_ERRORMSG_LEN + 1,
+											   MCXT_ALLOC_NO_OOM);
+	if (!state->errordata.message)
 	{
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
 	}
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 
 	/*
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
@@ -157,7 +162,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 */
 	if (!allocate_recordbuf(state, 0))
 	{
-		pfree(state->errormsg_buf);
+		pfree(state->errordata.message);
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
@@ -175,7 +180,7 @@ XLogReaderFree(XLogReaderState *state)
 	if (state->decode_buffer && state->free_decode_buffer)
 		pfree(state->decode_buffer);
 
-	pfree(state->errormsg_buf);
+	pfree(state->errordata.message);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	pfree(state->readBuf);
@@ -351,23 +356,27 @@ XLogReleasePreviousRecord(XLogReaderState *state)
  *
  * On success, a record is returned.
  *
- * The returned record (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogNextRecord.
+ * The returned record (or errordata->message) points to an internal buffer
+ * that's valid until the next call to XLogNextRecord.
  */
 DecodedXLogRecord *
-XLogNextRecord(XLogReaderState *state, char **errormsg)
+XLogNextRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
 
 	if (state->decode_queue_head == NULL)
 	{
-		*errormsg = NULL;
-		if (state->errormsg_deferred)
+		errordata->message = NULL;
+		errordata->code = XLOG_READER_NO_ERROR;
+		if (state->errordata_deferred)
 		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			state->errormsg_deferred = false;
+			if (state->errordata.message[0] != '\0')
+				errordata->message = state->errordata.message;
+			if (state->errordata.code != XLOG_READER_NO_ERROR)
+				errordata->code = state->errordata.code;
+			state->errordata_deferred = false;
+			state->errordata.code = XLOG_READER_NO_ERROR;
 		}
 
 		/*
@@ -397,7 +406,8 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	state->ReadRecPtr = state->record->lsn;
 	state->EndRecPtr = state->record->next_lsn;
 
-	*errormsg = NULL;
+	errordata->message = NULL;
+	errordata->code = XLOG_READER_NO_ERROR;
 
 	return state->record;
 }
@@ -409,17 +419,17 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error;
+ * errordata->message is set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
- * *errormsg is set to a string with details of the failure.
+ * *errordata is set with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * The returned pointer (or *errordata.message) points to an internal
+ * buffer that's valid until the next call to XLogReadRecord.
  */
 XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	DecodedXLogRecord *decoded;
 
@@ -437,7 +447,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		XLogReadAhead(state, false /* nonblocking */ );
 
 	/* Consume the head record or error. */
-	decoded = XLogNextRecord(state, errormsg);
+	decoded = XLogNextRecord(state, errordata);
 	if (decoded)
 	{
 		/*
@@ -546,7 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	bool		gotheader;
 	int			readOff;
 	DecodedXLogRecord *decoded;
-	char	   *errormsg;		/* not used */
+	XLogReaderError errordata = {0};		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -556,7 +566,8 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	randAccess = false;
 
 	/* reset error state */
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 	decoded = NULL;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -623,7 +634,9 @@ restart:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "invalid record offset at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  pageHeaderSize, targetRecOff);
 		goto err;
@@ -632,7 +645,9 @@ restart:
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "contrecord is requested by %X/%X",
 							  LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
@@ -673,6 +688,7 @@ restart:
 		if (total_len < SizeOfXLogRecord)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid record length at %X/%X: expected at least %u, got %u",
 								  LSN_FORMAT_ARGS(RecPtr),
 								  (uint32) SizeOfXLogRecord, total_len);
@@ -691,6 +707,7 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  !nonblocking /* allow_oversized */ );
+
 	if (decoded == NULL)
 	{
 		/*
@@ -702,6 +719,7 @@ restart:
 
 		/* We failed to allocate memory for an oversized record. */
 		report_invalid_record(state,
+							  XLOG_READER_OOM,
 							  "out of memory while trying to decode a record of length %u", total_len);
 		goto err;
 	}
@@ -724,7 +742,9 @@ restart:
 			!allocate_recordbuf(state, total_len))
 		{
 			/* We treat this as a "bogus data" condition */
-			report_invalid_record(state, "record length %u at %X/%X too long",
+			report_invalid_record(state,
+								  XLOG_READER_OOM,
+								  "record length %u at %X/%X too long",
 								  total_len, LSN_FORMAT_ARGS(RecPtr));
 			goto err;
 		}
@@ -773,6 +793,7 @@ restart:
 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "there is no contrecord flag at %X/%X",
 									  LSN_FORMAT_ARGS(RecPtr));
 				goto err;
@@ -786,6 +807,7 @@ restart:
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "invalid contrecord length %u (expected %lld) at %X/%X",
 									  pageHeader->xlp_rem_len,
 									  ((long long) total_len) - gotlen,
@@ -867,7 +889,7 @@ restart:
 		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errordata))
 	{
 		/* Record the location of the next record. */
 		decoded->next_lsn = state->NextRecPtr;
@@ -918,7 +940,7 @@ err:
 		 * queued so that XLogPrefetcherReadRecord() doesn't bring us back a
 		 * second time and clobber the above state.
 		 */
-		state->errormsg_deferred = true;
+		state->errordata_deferred = true;
 	}
 
 	if (decoded && decoded->oversized)
@@ -931,9 +953,9 @@ err:
 	XLogReaderInvalReadState(state);
 
 	/*
-	 * If an error was written to errmsg_buf, it'll be returned to the caller
-	 * of XLogReadRecord() after all successfully decoded records from the
-	 * read queue.
+	 * If an error was written to errordata.message, it'll be returned to the
+	 * caller of XLogReadRecord() after all successfully decoded records from
+	 * the read queue.
 	 */
 
 	return XLREAD_FAIL;
@@ -952,7 +974,7 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
 {
 	XLogPageReadResult result;
 
-	if (state->errormsg_deferred)
+	if (state->errordata_deferred)
 		return NULL;
 
 	result = XLogDecodeNextRecord(state, nonblocking);
@@ -970,8 +992,8 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
  * via the page_read() callback.
  *
  * Returns XLREAD_FAIL if the required page cannot be read for some
- * reason; errormsg_buf is set in that case (unless the error occurs in the
- * page_read callback).
+ * reason; errordata.message is set in that case (unless the error occurs in
+ * the page_read callback).
  *
  * Returns XLREAD_WOULDBLOCK if the requested data can't be read without
  * waiting.  This can be returned only if the installed page_read callback
@@ -1116,6 +1138,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid record length at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
@@ -1124,6 +1147,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid resource manager ID %u at %X/%X",
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
@@ -1137,6 +1161,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (!(record->xl_prev < RecPtr))
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1153,6 +1178,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1189,6 +1215,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "incorrect resource manager data checksum in record at %X/%X",
 							  LSN_FORMAT_ARGS(recptr));
 		return false;
@@ -1223,6 +1250,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_magic,
 							  fname,
@@ -1238,6 +1266,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1254,6 +1283,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			longhdr->xlp_sysid != state->system_identifier)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
 								  (unsigned long long) longhdr->xlp_sysid,
 								  (unsigned long long) state->system_identifier);
@@ -1262,12 +1292,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect segment size in page header");
 			return false;
 		}
 		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
 			return false;
 		}
@@ -1280,6 +1312,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1300,6 +1333,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
@@ -1326,6 +1360,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
 								  hdr->xlp_tli,
 								  state->latestPageTLI,
@@ -1347,8 +1382,9 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 void
 XLogReaderResetError(XLogReaderState *state)
 {
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1368,7 +1404,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -1453,7 +1489,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errordata) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1598,8 +1634,9 @@ ResetDecoder(XLogReaderState *state)
 	state->decode_buffer_head = state->decode_buffer;
 
 	/* Clear error state. */
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1649,7 +1686,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				 DecodedXLogRecord *decoded,
 				 XLogRecord *record,
 				 XLogRecPtr lsn,
-				 char **errormsg)
+				 XLogReaderError *errordata)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1732,6 +1769,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "out-of-order block_id %u at %X/%X",
 									  block_id,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1756,6 +1794,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (blk->has_data && blk->data_len == 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
@@ -1763,6 +1802,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (!blk->has_data && blk->data_len != 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
 									  (unsigned int) blk->data_len,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1799,6 +1839,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					 blk->bimg_len == BLCKSZ))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1815,6 +1856,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					(blk->hole_offset != 0 || blk->hole_length != 0))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1829,6 +1871,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len == BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
 										  (unsigned int) blk->bimg_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1844,6 +1887,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len != BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
 										  (unsigned int) blk->data_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1860,6 +1904,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				if (rlocator == NULL)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
@@ -1872,6 +1917,7 @@ DecodeXLogRecord(XLogReaderState *state,
 		else
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid block_id %u at %X/%X",
 								  block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
 			goto err;
@@ -1939,10 +1985,12 @@ DecodeXLogRecord(XLogReaderState *state,
 
 shortdata_err:
 	report_invalid_record(state,
+						  XLOG_READER_INVALID_DATA,
 						  "record with invalid length at %X/%X",
 						  LSN_FORMAT_ARGS(state->ReadRecPtr));
 err:
-	*errormsg = state->errormsg_buf;
+	errordata->message = state->errordata.message;
+	errordata->code = state->errordata.code;
 
 	return false;
 }
@@ -2049,6 +2097,7 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		!record->record->blocks[block_id].in_use)
 	{
 		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
 							  "could not restore image at %X/%X with invalid block %d specified",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
@@ -2056,7 +2105,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	}
 	if (!record->record->blocks[block_id].has_image)
 	{
-		report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
+		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
+							  "could not restore image at %X/%X with invalid state, block %d",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
 		return false;
@@ -2083,7 +2134,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 									bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "LZ4",
 								  block_id);
@@ -2100,7 +2153,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 			if (ZSTD_isError(decomp_result))
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "zstd",
 								  block_id);
@@ -2109,7 +2164,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		}
 		else
 		{
-			report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with unknown method, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
@@ -2117,7 +2174,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 		if (!decomp_success)
 		{
-			report_invalid_record(record, "could not decompress image at %X/%X, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not decompress image at %X/%X, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..68100bfa4a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2454,7 +2454,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		if (!RestoreBlockImage(record, block_id, primary_image_masked))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * If masking function is defined, mask both the primary and replay
@@ -3062,9 +3062,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 	for (;;)
 	{
-		char	   *errormsg;
+		XLogReaderError errordata = {0};
 
-		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
 			/*
@@ -3098,9 +3098,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * StandbyMode that only happens if we have been triggered, so we
 			 * shouldn't loop anymore in that case.
 			 */
-			if (errormsg)
+			if (errordata.message)
 				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+						(errmsg_internal("%s", errordata.message) /* already translated */ ));
 		}
 
 		/*
@@ -3385,9 +3385,9 @@ retry:
 		 * Emit this error right now then retry this page immediately. Use
 		 * errmsg_internal() because the message was already translated.
 		 */
-		if (xlogreader->errormsg_buf[0])
+		if (xlogreader->errordata.message[0])
 			ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-					(errmsg_internal("%s", xlogreader->errormsg_buf)));
+					(errmsg_internal("%s", xlogreader->errordata.message)));
 
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		XLogReaderResetError(xlogreader);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index e174a2a891..5c64454e7e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -395,7 +395,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 		if (!RestoreBlockImage(record, block_id, page))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * The page may be uninitialized. If so, we can't set the LSN because
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 41243d0187..f48feab944 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -641,12 +641,13 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	for (;;)
 	{
 		XLogRecord *record;
-		char	   *err = NULL;
+		XLogReaderError errordata = {0};
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
-		if (err)
-			elog(ERROR, "could not find logical decoding starting point: %s", err);
+		record = XLogReadRecord(ctx->reader, &errordata);
+		if (errordata.message)
+			elog(ERROR, "could not find logical decoding starting point: %s",
+				 errordata.message);
 		if (!record)
 			elog(ERROR, "could not find logical decoding starting point");
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 55a24c02c9..e7f74809e3 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,11 +244,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		while (ctx->reader->EndRecPtr < end_of_wal)
 		{
 			XLogRecord *record;
-			char	   *errm = NULL;
+			XLogReaderError errordata = {0};
 
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
-				elog(ERROR, "could not find record for logical decoding: %s", errm);
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
+				elog(ERROR, "could not find record for logical decoding: %s",
+					 errordata.message);
 
 			/*
 			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6035cf4816..4fa4e6bfed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -503,17 +503,17 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		/* Decode at least one record, until we run out of records */
 		while (ctx->reader->EndRecPtr < moveto)
 		{
-			char	   *errm = NULL;
 			XLogRecord *record;
+			XLogReaderError errordata = {0};
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
 				elog(ERROR, "could not find record while advancing replication slot: %s",
-					 errm);
+					 errordata.message);
 
 			/*
 			 * Process the record.  Storage-level changes are ignored in
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d27ef2985d..d05c60f09f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3045,7 +3045,7 @@ static void
 XLogSendLogical(void)
 {
 	XLogRecord *record;
-	char	   *errm;
+	XLogReaderError errordata = {0};
 
 	/*
 	 * We'll use the current flush point to determine whether we've caught up.
@@ -3063,12 +3063,12 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	record = XLogReadRecord(logical_decoding_ctx->reader, &errordata);
 
 	/* xlog record was invalid */
-	if (errm != NULL)
+	if (errordata.message != NULL)
 		elog(ERROR, "could not find record while sending logically-decoded data: %s",
-			 errm);
+			 errordata.message);
 
 	if (record != NULL)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..2705d9bf45 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -68,7 +68,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	private.tliIndex = tliIndex;
@@ -82,16 +82,16 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
 			XLogRecPtr	errptr = xlogreader->EndRecPtr;
 
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not read WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(errptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not read WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(errptr));
@@ -126,7 +126,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
@@ -139,12 +139,12 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			pg_fatal("could not read WAL record at %X/%X: %s",
-					 LSN_FORMAT_ARGS(ptr), errormsg);
+					 LSN_FORMAT_ARGS(ptr), errordata.message);
 		else
 			pg_fatal("could not read WAL record at %X/%X",
 					 LSN_FORMAT_ARGS(ptr));
@@ -173,7 +173,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecord *record;
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	/*
@@ -204,14 +204,14 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not find previous WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(searchptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not find previous WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(searchptr));
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e8b5a6cd61..4129ba901b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -508,7 +508,7 @@ XLogRecordSaveFPWs(XLogReaderState *record, const char *savepath)
 
 		/* Full page exists, so let's save it */
 		if (!RestoreBlockImage(record, block_id, page))
-			pg_fatal("%s", record->errormsg_buf);
+			pg_fatal("%s", record->errordata.message);
 
 		(void) XLogRecGetBlockTagExtended(record, block_id,
 										  &rnode, &fork, &blk, NULL);
@@ -796,7 +796,7 @@ main(int argc, char **argv)
 	XLogRecord *record;
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
@@ -1239,7 +1239,7 @@ main(int argc, char **argv)
 		}
 
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		record = XLogReadRecord(xlogreader_state, &errordata);
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
@@ -1304,10 +1304,10 @@ main(int argc, char **argv)
 	if (time_to_stop)
 		exit(0);
 
-	if (errormsg)
+	if (errordata.message)
 		pg_fatal("error in WAL record at %X/%X: %s",
 				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+				 errordata.message);
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 796a74f322..e7d30554ed 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -146,9 +146,9 @@ static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
@@ -161,11 +161,12 @@ ReadNextXLogRecord(XLogReaderState *xlogreader)
 		if (private_data->end_of_wal)
 			return NULL;
 
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+							LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+							errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -384,7 +385,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
 			if (!RestoreBlockImage(record, block_id, page))
 				ereport(ERROR,
 						(errcode(ERRCODE_INTERNAL_ERROR),
-						 errmsg_internal("%s", record->errormsg_buf)));
+						 errmsg_internal("%s", record->errordata.message)));
 
 			block_fpi_data = (bytea *) palloc(BLCKSZ + VARHDRSZ);
 			SET_VARSIZE(block_fpi_data, BLCKSZ + VARHDRSZ);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 66823bc2a7..53ce72c4c2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3077,6 +3077,8 @@ XLogPageReadResult
 XLogPrefetchStats
 XLogPrefetcher
 XLogPrefetcherFilter
+XLogReaderError
+XLogReaderErrorCode
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.40.1

v3-0002-Make-WAL-replay-more-robust-on-OOM-failures.patchtext/x-diff; charset=us-asciiDownload
From 56a11dc5cc8986c1a2fef9cb14508dd4dae82566 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 9 Aug 2023 17:41:41 +0900
Subject: [PATCH v3 2/3] Make WAL replay more robust on OOM failures

This takes advantage of the new error facility for WAL readers, allowing
WAL replay to loop when an out-of-memory happens when reading a record.
This was the origin of potential data loss scenarios, making WAL replay
more robust by acting like a standby here.
---
 src/backend/access/transam/xlogrecovery.c | 75 ++++++++++++++++-------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 68100bfa4a..a1149439e9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3067,29 +3067,50 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
-			/*
-			 * When we find that WAL ends in an incomplete record, keep track
-			 * of that record.  After recovery is done, we'll write a record
-			 * to indicate to downstream WAL readers that that portion is to
-			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
-			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+			switch (errordata.code)
 			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+				case XLOG_READER_NO_ERROR:
+					/* Possible when XLogPageRead() has failed */
+					Assert(!errordata.message);
+					/* FALLTHROUGH */
 
-			if (readFile >= 0)
-			{
-				close(readFile);
-				readFile = -1;
+				case XLOG_READER_INVALID_DATA:
+
+					/*
+					 * When we find that WAL ends in an incomplete record,
+					 * keep track of that record.  After recovery is done,
+					 * we'll write a record to indicate to downstream WAL
+					 * readers that that portion is to be ignored.
+					 *
+					 * However, when ArchiveRecoveryRequested = true, we're
+					 * going to switch to a new timeline at the end of
+					 * recovery. We will only copy WAL over to the new
+					 * timeline up to the end of the last complete record, so
+					 * if we did this, we would later create an overwrite
+					 * contrecord in the wrong place, breaking everything.
+					 */
+					if (!ArchiveRecoveryRequested &&
+						!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+					{
+						abortedRecPtr = xlogreader->abortedRecPtr;
+						missingContrecPtr = xlogreader->missingContrecPtr;
+					}
+
+					if (readFile >= 0)
+					{
+						close(readFile);
+						readFile = -1;
+					}
+					break;
+				case XLOG_READER_OOM:
+
+					/*
+					 * If we failed because of an out-of-memory problem, just
+					 * give up and retry recovery later.  It may be posible
+					 * that the WAL record to decode required a larger memory
+					 * allocation than what the host can offer.
+					 */
+					break;
 			}
 
 			/*
@@ -3147,9 +3168,12 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * WAL from the archive, even if pg_wal is completely empty, but
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
+			 *
+			 * It may be possible that the record was not decoded because of
+			 * an out-of-memory failure.  In this case, just loop.
 			 */
 			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+				!fetching_ckpt && errordata.code != XLOG_READER_OOM)
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
@@ -3173,9 +3197,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
+			/*
+			 * In standby mode or if the WAL record failed on a out-of-memory,
+			 * loop back to retry.  Otherwise, give up.
+			 */
 			if (StandbyMode && !CheckForStandbyTrigger())
 				continue;
+			else if (errordata.code == XLOG_READER_OOM)
+				continue;
 			else
 				return NULL;
 		}
-- 
2.40.1

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#16)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]

Right, I remember this one. I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough. At least that's enough
for the cases we've seen. I'll think a bit more about it, as well.

Yeah, agreed. That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#17)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]

Right, I remember this one. I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough. At least that's enough
for the cases we've seen. I'll think a bit more about it, as well.

Yeah, agreed. That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#18)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:

At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]

Right, I remember this one. I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread. We may lose data and
finish with corrupted pages. I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.
The infra introduced by 0001 and something like 0002 that allows the
startup process to take a different path depending on the type of
error are still needed to avoid a too early end-of-recovery, though.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct. That's not an easy problem,
because the record CRC works for all the contents of the record, and
we look at the record header before that. Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough. At least that's enough
for the cases we've seen. I'll think a bit more about it, as well.

Yeah, agreed. That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

Are you referring about a retry if there is a standby.signal? I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery. So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to
archive recovery and a different source, right? I think that this
argument can go two ways, because it could be more helpful for some to
see a FATAL when we are still in crash recovery, even if there is a
standby.signal. It does not seem to me that we have a clear
definition about what to do in which case, either. Now we just fail
and hope for the best when doing crash recovery.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see. One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary. Perhaps that's just OK to keep doing that for a
standby. At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.
--
Michael

#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#19)
Re: Incorrect handling of OOM in WAL replay leading to data loss

At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:

At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread. We may lose data and
finish with corrupted pages. I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.

Does this mean we will address the distinction between an OOM and a
corrupt total record length later on? If that's the case, should we
modify that behavior right now?

The infra introduced by 0001 and something like 0002 that allows the
startup process to take a different path depending on the type of
error are still needed to avoid a too early end-of-recovery, though.

Agreed.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct. That's not an easy problem,

I'm not quite sure what "correct" means here. I believe xlogreader
runs various checks since the data may be incorrect. Given that can
break for various reasons, during crash recovery, we continue as long
as incoming WAL record remains consistently valid. The problem raised
here that we can't distinctly identify an OOM from a corrupted total
record length field. One reason is we check the data after a part is
loaded, but we can't load all the bytes from the record into memory in
such cases.

because the record CRC works for all the contents of the record, and
we look at the record header before that. Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

Sounds reasonable. By using CRC to protect the header part and
allocating a fixed-length buffer for it, I believe we're adopting a
standard approach and can identify OOM and other kind of header errors
with a good degree of certainty.

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

Even now, it seems like we're balancing the risk of potential data
loss against the potential inability to start the server. I meant
that.. if I had to do choose, I'd lean slightly towards prioritizing
saving the latter, in other words, keeping the current behavior.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

Are you referring about a retry if there is a standby.signal? I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery. So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to

Apologies for the confusion. What I was thinking is that an OOM is
more likely to occur in replication downstreams than during server
startup. I also felt for the latter case that such a challenging
environment probably wouldn't let the server enter stable normal
operation.

archive recovery and a different source, right? I think that this
argument can go two ways, because it could be more helpful for some to
see a FATAL when we are still in crash recovery, even if there is a
standby.signal. It does not seem to me that we have a clear
definition about what to do in which case, either. Now we just fail
and hope for the best when doing crash recovery.

Agreed.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see. One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary. Perhaps that's just OK to keep doing that for a
standby. At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.

Yeah, I'm with you on focusing on crash recovery cases; that's what I
ntended. Sorry for any confusion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#20)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote:

At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:

At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread. We may lose data and
finish with corrupted pages. I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.

Does this mean we will address the distinction between an OOM and a
corrupt total record length later on? If that's the case, should we
modify that behavior right now?

My apologies if I sounded unclear here. It seems to me that we should
wrap the patch on [1] first, and get it backpatched. At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code. (Was looking at it, actually.)

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct. That's not an easy problem,

I'm not quite sure what "correct" means here. I believe xlogreader
runs various checks since the data may be incorrect. Given that can
break for various reasons, during crash recovery, we continue as long
as incoming WAL record remains consistently valid. The problem raised
here that we can't distinctly identify an OOM from a corrupted total
record length field. One reason is we check the data after a part is
loaded, but we can't load all the bytes from the record into memory in
such cases.

Yep. Correct means that we end recovery in a consistent state, not
too early than we should.

because the record CRC works for all the contents of the record, and
we look at the record header before that. Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

Sounds reasonable. By using CRC to protect the header part and
allocating a fixed-length buffer for it, I believe we're adopting a
standard approach and can identify OOM and other kind of header errors
with a good degree of certainty.

Not something that I'd like to cover in this patch set, though.. This
is a problem on its own.

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

Even now, it seems like we're balancing the risk of potential data
loss against the potential inability to start the server. I meant
that.. if I had to do choose, I'd lean slightly towards prioritizing
saving the latter, in other words, keeping the current behavior.

On OOM, this means data loss and silent corruption. A failure has the
merit to tell someone that something is wrong, at least, and that
they'd better look at it rather than hope for the best.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

Are you referring about a retry if there is a standby.signal? I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery. So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to

Apologies for the confusion. What I was thinking is that an OOM is
more likely to occur in replication downstreams than during server
startup. I also felt for the latter case that such a challenging
environment probably wouldn't let the server enter stable normal
operation.

It depends on what the user does with the host running the cluster.
Both could be impacted.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see. One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary. Perhaps that's just OK to keep doing that for a
standby. At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.

Yeah, I'm with you on focusing on crash recovery cases; that's what I
intended. Sorry for any confusion.

Okay, so we're on the same page here, keeping standbys as they are and
do something for the crash recovery case.
--
Michael

#22Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#21)
4 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Thu, Aug 10, 2023 at 02:59:07PM +0900, Michael Paquier wrote:

My apologies if I sounded unclear here. It seems to me that we should
wrap the patch on [1] first, and get it backpatched. At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code. (Was looking at it, actually.)

Now that Thomas Munro has addressed the original problem to be able to
trust correctly xl_tot_len with bae868caf22, I am coming back to this
thread.

First, attached is a rebased set:
- 0001 to introduce the new error infra for xlogreader.c with an error
code, so as callers can make the difference between an OOM and an
invalid record.
- 0002 to tweak the startup process. Once again, I've taken the
approach to make the startup process behave like a standby on crash
recovery: each time that an OOM is found, we loop and retry.
- 0003 to emulate an OOM failure, that can be used with the script
attached to see that we don't stop recovery too early.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see. One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary. Perhaps that's just OK to keep doing that for a
standby. At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.

Yeah, I'm with you on focusing on crash recovery cases; that's what I
intended. Sorry for any confusion.

Okay, so we're on the same page here, keeping standbys as they are and
do something for the crash recovery case.

For the crash recovery case, one argument that stood out in my mind is
that causing a hard failure has the disadvantage to force users to do
again WAL replay from the last redo position, which may be far away
even if the checkpointer now runs during crash recovery. What I am
proposing on this thread has the merit to avoid that. Anyway, let's
discuss more before settling this point for the crash recovery case.

By the way, anything that I am proposing here cannot be backpatched
because of the infrastructure changes required in walreader.c, so I am
going to create a second thread with something that could be
backpatched (yeah, likely FATALs on OOM to stop recovery from doing
something bad)..
--
Michael

Attachments:

v4-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patchtext/x-diff; charset=us-asciiDownload
From 3d7bb24fc2f9f070273b63208819c4e54e428d18 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:40:05 +0900
Subject: [PATCH v4 1/3] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h           |   2 +-
 src/include/access/xlogreader.h               |  33 +++-
 src/backend/access/transam/twophase.c         |   8 +-
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c       | 170 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  14 +-
 src/backend/access/transam/xlogutils.c        |   2 +-
 src/backend/replication/logical/logical.c     |   9 +-
 .../replication/logical/logicalfuncs.c        |   9 +-
 src/backend/replication/slotfuncs.c           |   8 +-
 src/backend/replication/walsender.c           |   8 +-
 src/bin/pg_rewind/parsexlog.c                 |  24 +--
 src/bin/pg_waldump/pg_waldump.c               |  10 +-
 contrib/pg_walinspect/pg_walinspect.c         |  11 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 201 insertions(+), 119 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+											XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..06664dc6fb 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,24 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NO_ERROR = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -307,9 +325,9 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-	/* Buffer to hold error message */
-	char	   *errormsg_buf;
-	bool		errormsg_deferred;
+	/* Error state data */
+	XLogReaderError errordata;
+	bool		errordata_deferred;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -324,7 +342,8 @@ struct XLogReaderState
 static inline bool
 XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
 {
-	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+	return (state->decode_queue_head != NULL) ||
+		state->errordata_deferred;
 }
 
 /* Get a new XLogReader */
@@ -355,11 +374,11 @@ typedef enum XLogPageReadResult
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Release the previously returned record, if necessary. */
 extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
@@ -399,7 +418,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 							 DecodedXLogRecord *decoded,
 							 XLogRecord *record,
 							 XLogRecPtr lsn,
-							 char **errormsg);
+							 XLogReaderError *errordata);
 
 /*
  * Macros that provide access to parts of the record most recently returned by
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..08bd6586ec 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1399,7 +1399,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1413,15 +1413,15 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read two-phase state from WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(lsn), errormsg)));
+							LSN_FORMAT_ARGS(lsn), errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..56dd9f5b64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -953,7 +953,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		DecodedXLogRecord *decoded;
 		StringInfoData buf;
 		StringInfoData recordBuf;
-		char	   *errormsg = NULL;
+		XLogReaderError errordata = {0};
 		MemoryContext oldCxt;
 
 		oldCxt = MemoryContextSwitchTo(walDebugCxt);
@@ -987,10 +987,10 @@ XLogInsertRecord(XLogRecData *rdata,
 								   decoded,
 								   record,
 								   EndPos,
-								   &errormsg))
+								   &errordata))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
-							 errormsg ? errormsg : "no error message");
+							 errordata.message ? errordata.message : "no error message");
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..92d691ca49 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -984,7 +984,7 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
 XLogRecord *
-XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, XLogReaderError *errdata)
 {
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
@@ -1052,7 +1052,7 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/* Read the next record. */
-	record = XLogNextRecord(prefetcher->reader, errmsg);
+	record = XLogNextRecord(prefetcher->reader, errdata);
 	if (!record)
 		return NULL;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a17263df20..fd1413b6d3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -41,8 +41,10 @@
 #include "common/logging.h"
 #endif
 
-static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
-			pg_attribute_printf(2, 3);
+static void report_invalid_record(XLogReaderState *state,
+								  XLogReaderErrorCode errorcode,
+								  const char *fmt,...)
+			pg_attribute_printf(3, 4);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
@@ -66,21 +68,23 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 #define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
 
 /*
- * Construct a string in state->errormsg_buf explaining what's wrong with
+ * Construct a string in state->errordata.message explaining what's wrong with
  * the current record being read.
  */
 static void
-report_invalid_record(XLogReaderState *state, const char *fmt,...)
+report_invalid_record(XLogReaderState *state, XLogReaderErrorCode errorcode,
+					  const char *fmt,...)
 {
 	va_list		args;
 
 	fmt = _(fmt);
 
 	va_start(args, fmt);
-	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
+	vsnprintf(state->errordata.message, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
 
-	state->errormsg_deferred = true;
+	state->errordata_deferred = true;
+	state->errordata.code = errorcode;
 }
 
 /*
@@ -141,15 +145,16 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
 	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
-	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
-										  MCXT_ALLOC_NO_OOM);
-	if (!state->errormsg_buf)
+	state->errordata.message = palloc_extended(MAX_ERRORMSG_LEN + 1,
+											   MCXT_ALLOC_NO_OOM);
+	if (!state->errordata.message)
 	{
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
 	}
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 
 	/*
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
@@ -157,7 +162,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 */
 	if (!allocate_recordbuf(state, 0))
 	{
-		pfree(state->errormsg_buf);
+		pfree(state->errordata.message);
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
@@ -175,7 +180,7 @@ XLogReaderFree(XLogReaderState *state)
 	if (state->decode_buffer && state->free_decode_buffer)
 		pfree(state->decode_buffer);
 
-	pfree(state->errormsg_buf);
+	pfree(state->errordata.message);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	pfree(state->readBuf);
@@ -335,23 +340,27 @@ XLogReleasePreviousRecord(XLogReaderState *state)
  *
  * On success, a record is returned.
  *
- * The returned record (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogNextRecord.
+ * The returned record (or errordata->message) points to an internal buffer
+ * that's valid until the next call to XLogNextRecord.
  */
 DecodedXLogRecord *
-XLogNextRecord(XLogReaderState *state, char **errormsg)
+XLogNextRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
 
 	if (state->decode_queue_head == NULL)
 	{
-		*errormsg = NULL;
-		if (state->errormsg_deferred)
+		errordata->message = NULL;
+		errordata->code = XLOG_READER_NO_ERROR;
+		if (state->errordata_deferred)
 		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			state->errormsg_deferred = false;
+			if (state->errordata.message[0] != '\0')
+				errordata->message = state->errordata.message;
+			if (state->errordata.code != XLOG_READER_NO_ERROR)
+				errordata->code = state->errordata.code;
+			state->errordata_deferred = false;
+			state->errordata.code = XLOG_READER_NO_ERROR;
 		}
 
 		/*
@@ -381,7 +390,8 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	state->ReadRecPtr = state->record->lsn;
 	state->EndRecPtr = state->record->next_lsn;
 
-	*errormsg = NULL;
+	errordata->message = NULL;
+	errordata->code = XLOG_READER_NO_ERROR;
 
 	return state->record;
 }
@@ -393,17 +403,17 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error;
+ * errordata->message is set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
- * *errormsg is set to a string with details of the failure.
+ * *errordata is set with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * The returned pointer (or *errordata.message) points to an internal
+ * buffer that's valid until the next call to XLogReadRecord.
  */
 XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	DecodedXLogRecord *decoded;
 
@@ -421,7 +431,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		XLogReadAhead(state, false /* nonblocking */ );
 
 	/* Consume the head record or error. */
-	decoded = XLogNextRecord(state, errormsg);
+	decoded = XLogNextRecord(state, errordata);
 	if (decoded)
 	{
 		/*
@@ -530,7 +540,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	bool		gotheader;
 	int			readOff;
 	DecodedXLogRecord *decoded;
-	char	   *errormsg;		/* not used */
+	XLogReaderError errordata = {0};	/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -540,7 +550,8 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	randAccess = false;
 
 	/* reset error state */
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 	decoded = NULL;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -607,7 +618,9 @@ restart:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "invalid record offset at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  pageHeaderSize, targetRecOff);
 		goto err;
@@ -616,7 +629,9 @@ restart:
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "contrecord is requested by %X/%X",
 							  LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
@@ -657,6 +672,7 @@ restart:
 		if (total_len < SizeOfXLogRecord)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid record length at %X/%X: expected at least %u, got %u",
 								  LSN_FORMAT_ARGS(RecPtr),
 								  (uint32) SizeOfXLogRecord, total_len);
@@ -746,6 +762,7 @@ restart:
 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "there is no contrecord flag at %X/%X",
 									  LSN_FORMAT_ARGS(RecPtr));
 				goto err;
@@ -759,6 +776,7 @@ restart:
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "invalid contrecord length %u (expected %lld) at %X/%X",
 									  pageHeader->xlp_rem_len,
 									  ((long long) total_len) - gotlen,
@@ -817,8 +835,10 @@ restart:
 				memcpy(save_copy, state->readRecordBuf, gotlen);
 				if (!allocate_recordbuf(state, total_len))
 				{
-					/* We treat this as a "bogus data" condition */
-					report_invalid_record(state, "record length %u at %X/%X too long",
+					/* We treat this as an out-of-memory error */
+					report_invalid_record(state,
+										  XLOG_READER_OOM,
+										  "record length %u at %X/%X too long",
 										  total_len, LSN_FORMAT_ARGS(RecPtr));
 					goto err;
 				}
@@ -881,15 +901,16 @@ restart:
 		{
 			/*
 			 * We failed to allocate memory for an oversized record.  As
-			 * above, we currently treat this as a "bogus data" condition.
+			 * above, we currently treat this as an out-of-memory error.
 			 */
 			report_invalid_record(state,
+								  XLOG_READER_OOM,
 								  "out of memory while trying to decode a record of length %u", total_len);
 			goto err;
 		}
 	}
 
-	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errordata))
 	{
 		/* Record the location of the next record. */
 		decoded->next_lsn = state->NextRecPtr;
@@ -938,7 +959,7 @@ err:
 		 * queued so that XLogPrefetcherReadRecord() doesn't bring us back a
 		 * second time and clobber the above state.
 		 */
-		state->errormsg_deferred = true;
+		state->errordata_deferred = true;
 	}
 
 	if (decoded && decoded->oversized)
@@ -951,9 +972,9 @@ err:
 	XLogReaderInvalReadState(state);
 
 	/*
-	 * If an error was written to errmsg_buf, it'll be returned to the caller
-	 * of XLogReadRecord() after all successfully decoded records from the
-	 * read queue.
+	 * If an error was written to errordata.message, it'll be returned to the
+	 * caller of XLogReadRecord() after all successfully decoded records from
+	 * the read queue.
 	 */
 
 	return XLREAD_FAIL;
@@ -972,7 +993,7 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
 {
 	XLogPageReadResult result;
 
-	if (state->errormsg_deferred)
+	if (state->errordata_deferred)
 		return NULL;
 
 	result = XLogDecodeNextRecord(state, nonblocking);
@@ -990,8 +1011,8 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
  * via the page_read() callback.
  *
  * Returns XLREAD_FAIL if the required page cannot be read for some
- * reason; errormsg_buf is set in that case (unless the error occurs in the
- * page_read callback).
+ * reason; errordata.message is set in that case (unless the error occurs in
+ * the page_read callback).
  *
  * Returns XLREAD_WOULDBLOCK if the requested data can't be read without
  * waiting.  This can be returned only if the installed page_read callback
@@ -1136,6 +1157,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid record length at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
@@ -1144,6 +1166,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid resource manager ID %u at %X/%X",
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
@@ -1157,6 +1180,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (!(record->xl_prev < RecPtr))
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1173,6 +1197,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1211,6 +1236,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "incorrect resource manager data checksum in record at %X/%X",
 							  LSN_FORMAT_ARGS(recptr));
 		return false;
@@ -1245,6 +1271,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_magic,
 							  fname,
@@ -1260,6 +1287,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1276,6 +1304,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			longhdr->xlp_sysid != state->system_identifier)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
 								  (unsigned long long) longhdr->xlp_sysid,
 								  (unsigned long long) state->system_identifier);
@@ -1284,12 +1313,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect segment size in page header");
 			return false;
 		}
 		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
 			return false;
 		}
@@ -1302,6 +1333,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1322,6 +1354,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
@@ -1348,6 +1381,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
 								  hdr->xlp_tli,
 								  state->latestPageTLI,
@@ -1369,8 +1403,9 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 void
 XLogReaderResetError(XLogReaderState *state)
 {
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1390,7 +1425,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -1475,7 +1510,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errordata) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1620,8 +1655,9 @@ ResetDecoder(XLogReaderState *state)
 	state->decode_buffer_head = state->decode_buffer;
 
 	/* Clear error state. */
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1671,7 +1707,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				 DecodedXLogRecord *decoded,
 				 XLogRecord *record,
 				 XLogRecPtr lsn,
-				 char **errormsg)
+				 XLogReaderError *errordata)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1754,6 +1790,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "out-of-order block_id %u at %X/%X",
 									  block_id,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1778,6 +1815,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (blk->has_data && blk->data_len == 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
@@ -1785,6 +1823,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (!blk->has_data && blk->data_len != 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
 									  (unsigned int) blk->data_len,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1821,6 +1860,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					 blk->bimg_len == BLCKSZ))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1837,6 +1877,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					(blk->hole_offset != 0 || blk->hole_length != 0))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1851,6 +1892,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len == BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
 										  (unsigned int) blk->bimg_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1866,6 +1908,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len != BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
 										  (unsigned int) blk->data_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1882,6 +1925,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				if (rlocator == NULL)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
@@ -1894,6 +1938,7 @@ DecodeXLogRecord(XLogReaderState *state,
 		else
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid block_id %u at %X/%X",
 								  block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
 			goto err;
@@ -1961,10 +2006,12 @@ DecodeXLogRecord(XLogReaderState *state,
 
 shortdata_err:
 	report_invalid_record(state,
+						  XLOG_READER_INVALID_DATA,
 						  "record with invalid length at %X/%X",
 						  LSN_FORMAT_ARGS(state->ReadRecPtr));
 err:
-	*errormsg = state->errormsg_buf;
+	errordata->message = state->errordata.message;
+	errordata->code = state->errordata.code;
 
 	return false;
 }
@@ -2071,6 +2118,7 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		!record->record->blocks[block_id].in_use)
 	{
 		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
 							  "could not restore image at %X/%X with invalid block %d specified",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
@@ -2078,7 +2126,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	}
 	if (!record->record->blocks[block_id].has_image)
 	{
-		report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
+		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
+							  "could not restore image at %X/%X with invalid state, block %d",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
 		return false;
@@ -2105,7 +2155,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 									bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "LZ4",
 								  block_id);
@@ -2122,7 +2174,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 			if (ZSTD_isError(decomp_result))
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "zstd",
 								  block_id);
@@ -2131,7 +2185,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		}
 		else
 		{
-			report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with unknown method, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
@@ -2139,7 +2195,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 		if (!decomp_success)
 		{
-			report_invalid_record(record, "could not decompress image at %X/%X, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not decompress image at %X/%X, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..68100bfa4a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2454,7 +2454,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		if (!RestoreBlockImage(record, block_id, primary_image_masked))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * If masking function is defined, mask both the primary and replay
@@ -3062,9 +3062,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 	for (;;)
 	{
-		char	   *errormsg;
+		XLogReaderError errordata = {0};
 
-		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
 			/*
@@ -3098,9 +3098,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * StandbyMode that only happens if we have been triggered, so we
 			 * shouldn't loop anymore in that case.
 			 */
-			if (errormsg)
+			if (errordata.message)
 				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+						(errmsg_internal("%s", errordata.message) /* already translated */ ));
 		}
 
 		/*
@@ -3385,9 +3385,9 @@ retry:
 		 * Emit this error right now then retry this page immediately. Use
 		 * errmsg_internal() because the message was already translated.
 		 */
-		if (xlogreader->errormsg_buf[0])
+		if (xlogreader->errordata.message[0])
 			ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-					(errmsg_internal("%s", xlogreader->errormsg_buf)));
+					(errmsg_internal("%s", xlogreader->errordata.message)));
 
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		XLogReaderResetError(xlogreader);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..a50fc9cb97 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -395,7 +395,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 		if (!RestoreBlockImage(record, block_id, page))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * The page may be uninitialized. If so, we can't set the LSN because
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 41243d0187..f48feab944 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -641,12 +641,13 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	for (;;)
 	{
 		XLogRecord *record;
-		char	   *err = NULL;
+		XLogReaderError errordata = {0};
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
-		if (err)
-			elog(ERROR, "could not find logical decoding starting point: %s", err);
+		record = XLogReadRecord(ctx->reader, &errordata);
+		if (errordata.message)
+			elog(ERROR, "could not find logical decoding starting point: %s",
+				 errordata.message);
 		if (!record)
 			elog(ERROR, "could not find logical decoding starting point");
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 197169d6b0..ca372e5f66 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,11 +244,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		while (ctx->reader->EndRecPtr < end_of_wal)
 		{
 			XLogRecord *record;
-			char	   *errm = NULL;
+			XLogReaderError errordata = {0};
 
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
-				elog(ERROR, "could not find record for logical decoding: %s", errm);
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
+				elog(ERROR, "could not find record for logical decoding: %s",
+					 errordata.message);
 
 			/*
 			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6035cf4816..4fa4e6bfed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -503,17 +503,17 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		/* Decode at least one record, until we run out of records */
 		while (ctx->reader->EndRecPtr < moveto)
 		{
-			char	   *errm = NULL;
 			XLogRecord *record;
+			XLogReaderError errordata = {0};
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
 				elog(ERROR, "could not find record while advancing replication slot: %s",
-					 errm);
+					 errordata.message);
 
 			/*
 			 * Process the record.  Storage-level changes are ignored in
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..55109bfa51 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3045,7 +3045,7 @@ static void
 XLogSendLogical(void)
 {
 	XLogRecord *record;
-	char	   *errm;
+	XLogReaderError errordata = {0};
 
 	/*
 	 * We'll use the current flush point to determine whether we've caught up.
@@ -3063,12 +3063,12 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	record = XLogReadRecord(logical_decoding_ctx->reader, &errordata);
 
 	/* xlog record was invalid */
-	if (errm != NULL)
+	if (errordata.message != NULL)
 		elog(ERROR, "could not find record while sending logically-decoded data: %s",
-			 errm);
+			 errordata.message);
 
 	if (record != NULL)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..2705d9bf45 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -68,7 +68,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	private.tliIndex = tliIndex;
@@ -82,16 +82,16 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
 			XLogRecPtr	errptr = xlogreader->EndRecPtr;
 
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not read WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(errptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not read WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(errptr));
@@ -126,7 +126,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
@@ -139,12 +139,12 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			pg_fatal("could not read WAL record at %X/%X: %s",
-					 LSN_FORMAT_ARGS(ptr), errormsg);
+					 LSN_FORMAT_ARGS(ptr), errordata.message);
 		else
 			pg_fatal("could not read WAL record at %X/%X",
 					 LSN_FORMAT_ARGS(ptr));
@@ -173,7 +173,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecord *record;
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	/*
@@ -204,14 +204,14 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not find previous WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(searchptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not find previous WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(searchptr));
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a3535bdfa9..880c93b51b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -512,7 +512,7 @@ XLogRecordSaveFPWs(XLogReaderState *record, const char *savepath)
 
 		/* Full page exists, so let's save it */
 		if (!RestoreBlockImage(record, block_id, page))
-			pg_fatal("%s", record->errormsg_buf);
+			pg_fatal("%s", record->errordata.message);
 
 		(void) XLogRecGetBlockTagExtended(record, block_id,
 										  &rnode, &fork, &blk, NULL);
@@ -800,7 +800,7 @@ main(int argc, char **argv)
 	XLogRecord *record;
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
@@ -1243,7 +1243,7 @@ main(int argc, char **argv)
 		}
 
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		record = XLogReadRecord(xlogreader_state, &errordata);
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
@@ -1308,10 +1308,10 @@ main(int argc, char **argv)
 	if (time_to_stop)
 		exit(0);
 
-	if (errormsg)
+	if (errordata.message)
 		pg_fatal("error in WAL record at %X/%X: %s",
 				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+				 errordata.message);
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 796a74f322..e7d30554ed 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -146,9 +146,9 @@ static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
@@ -161,11 +161,12 @@ ReadNextXLogRecord(XLogReaderState *xlogreader)
 		if (private_data->end_of_wal)
 			return NULL;
 
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+							LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+							errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -384,7 +385,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
 			if (!RestoreBlockImage(record, block_id, page))
 				ereport(ERROR,
 						(errcode(ERRCODE_INTERNAL_ERROR),
-						 errmsg_internal("%s", record->errormsg_buf)));
+						 errmsg_internal("%s", record->errordata.message)));
 
 			block_fpi_data = (bytea *) palloc(BLCKSZ + VARHDRSZ);
 			SET_VARSIZE(block_fpi_data, BLCKSZ + VARHDRSZ);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b5bbdd1608..35cb4b82f4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3080,6 +3080,8 @@ XLogPageReadResult
 XLogPrefetchStats
 XLogPrefetcher
 XLogPrefetcherFilter
+XLogReaderError
+XLogReaderErrorCode
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.40.1

v4-0002-Make-WAL-replay-more-robust-on-OOM-failures.patchtext/x-diff; charset=us-asciiDownload
From 619c8bc3c2bd3a2cf24a283dcfc9666b9769b50c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:23:37 +0900
Subject: [PATCH v4 2/3] Make WAL replay more robust on OOM failures

This takes advantage of the new error facility for WAL readers, allowing
WAL replay to loop when an out-of-memory happens when reading a record.
This was the origin of potential data loss scenarios, making crash
recovery more robust by acting the same way as a standby here: each time
a record cannot be read because of an OOM, loop and try to read again
the record.
---
 src/backend/access/transam/xlogrecovery.c | 75 ++++++++++++++++-------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 68100bfa4a..ed5ac06938 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3067,29 +3067,50 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
-			/*
-			 * When we find that WAL ends in an incomplete record, keep track
-			 * of that record.  After recovery is done, we'll write a record
-			 * to indicate to downstream WAL readers that that portion is to
-			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
-			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+			switch (errordata.code)
 			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+				case XLOG_READER_NO_ERROR:
+					/* Possible when XLogPageRead() has failed */
+					Assert(!errordata.message);
+					/* FALLTHROUGH */
 
-			if (readFile >= 0)
-			{
-				close(readFile);
-				readFile = -1;
+				case XLOG_READER_INVALID_DATA:
+
+					/*
+					 * When we find that WAL ends in an incomplete record,
+					 * keep track of that record.  After recovery is done,
+					 * we'll write a record to indicate to downstream WAL
+					 * readers that that portion is to be ignored.
+					 *
+					 * However, when ArchiveRecoveryRequested = true, we're
+					 * going to switch to a new timeline at the end of
+					 * recovery. We will only copy WAL over to the new
+					 * timeline up to the end of the last complete record, so
+					 * if we did this, we would later create an overwrite
+					 * contrecord in the wrong place, breaking everything.
+					 */
+					if (!ArchiveRecoveryRequested &&
+						!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+					{
+						abortedRecPtr = xlogreader->abortedRecPtr;
+						missingContrecPtr = xlogreader->missingContrecPtr;
+					}
+
+					if (readFile >= 0)
+					{
+						close(readFile);
+						readFile = -1;
+					}
+					break;
+				case XLOG_READER_OOM:
+
+					/*
+					 * If we failed because of an out-of-memory problem, just
+					 * give up and retry recovery later.  It may be posible
+					 * that the WAL record to decode required a larger memory
+					 * allocation than what the host can offer.
+					 */
+					break;
 			}
 
 			/*
@@ -3147,9 +3168,12 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * WAL from the archive, even if pg_wal is completely empty, but
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
+			 *
+			 * It may be possible that the record was not decoded because of
+			 * an out-of-memory failure.  In this case, just loop.
 			 */
 			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+				!fetching_ckpt && errordata.code != XLOG_READER_OOM)
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
@@ -3173,9 +3197,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
+			/*
+			 * In standby mode or if the WAL record failed on an
+			 * out-of-memory, loop back and retry.  Otherwise, give up.
+			 */
 			if (StandbyMode && !CheckForStandbyTrigger())
 				continue;
+			else if (errordata.code == XLOG_READER_OOM)
+				continue;
 			else
 				return NULL;
 		}
-- 
2.40.1

v4-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patchtext/x-diff; charset=us-asciiDownload
From 146d50748f8a0c308ac39ea20b88a3b289c8d269 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:23:50 +0900
Subject: [PATCH v4 3/3] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 31 +++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fd1413b6d3..854f584e30 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -541,6 +541,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	XLogReaderError errordata = {0};	/* not used */
+	bool        trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -690,7 +691,29 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  false /* allow_oversized */ );
-	if (decoded == NULL && nonblocking)
+
+#ifndef FRONTEND
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records
+	 * replayed.
+	 */
+	{
+		struct stat fstat;
+		static int counter = 0;
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if ((decoded == NULL || trigger_oom) && nonblocking)
 	{
 		/*
 		 * There is no space in the circular decode buffer, and the caller is
@@ -833,7 +856,7 @@ restart:
 				Assert(gotlen <= lengthof(save_copy));
 				Assert(gotlen <= state->readRecordBufSize);
 				memcpy(save_copy, state->readRecordBuf, gotlen);
-				if (!allocate_recordbuf(state, total_len))
+				if (!allocate_recordbuf(state, total_len) || trigger_oom)
 				{
 					/* We treat this as an out-of-memory error */
 					report_invalid_record(state,
@@ -891,13 +914,13 @@ restart:
 	 * If we got here without a DecodedXLogRecord, it means we needed to
 	 * validate total_len before trusting it, but by now now we've done that.
 	 */
-	if (decoded == NULL)
+	if (decoded == NULL || trigger_oom)
 	{
 		Assert(!nonblocking);
 		decoded = XLogReadRecordAlloc(state,
 									  total_len,
 									  true /* allow_oversized */ );
-		if (decoded == NULL)
+		if (decoded == NULL || trigger_oom)
 		{
 			/*
 			 * We failed to allocate memory for an oversized record.  As
-- 
2.40.1

xlogreader_oom.bashtext/plain; charset=us-asciiDownload
#23Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#22)
4 attachment(s)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Sep 26, 2023 at 03:48:07PM +0900, Michael Paquier wrote:

By the way, anything that I am proposing here cannot be backpatched
because of the infrastructure changes required in walreader.c, so I am
going to create a second thread with something that could be
backpatched (yeah, likely FATALs on OOM to stop recovery from doing
something bad)..

Patch set is rebased as an effect of 6b18b3fe2c2f, that switched the
OOMs to fail harder now in xlogreader.c. The patch set has nothing
new, except that 0001 is now a revert of 6b18b3fe2c2f to switch back
xlogreader.c to use soft errors on OOMs.

If there's no interest in this patch set after the next CF, I'm OK to
drop it. The state of HEAD is at least correct in the OOM cases now.
--
Michael

Attachments:

v5-0001-Revert-Fail-hard-on-out-of-memory-failures-in-xlo.patchtext/x-diff; charset=us-asciiDownload
From aa5377d221371f6be8729a27f1df18aa9c4a48e2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 3 Oct 2023 16:12:14 +0900
Subject: [PATCH v5 1/4] Revert "Fail hard on out-of-memory failures in
 xlogreader.c"

This reverts commit 6b18b3fe2c2, putting back the code of xlogreader.c
to handle OOMs as soft failures.
---
 src/backend/access/transam/xlogreader.c | 47 ++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a1363e3b8f..a17263df20 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -43,7 +43,7 @@
 
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
-static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
+static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
@@ -155,7 +155,14 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
 	 * enlarged if necessary.
 	 */
-	allocate_recordbuf(state, 0);
+	if (!allocate_recordbuf(state, 0))
+	{
+		pfree(state->errormsg_buf);
+		pfree(state->readBuf);
+		pfree(state);
+		return NULL;
+	}
+
 	return state;
 }
 
@@ -177,6 +184,7 @@ XLogReaderFree(XLogReaderState *state)
 
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
+ * Returns true if successful, false if out of memory.
  *
  * readRecordBufSize is set to the new buffer size.
  *
@@ -188,7 +196,7 @@ XLogReaderFree(XLogReaderState *state)
  * Note: This routine should *never* be called for xl_tot_len until the header
  * of the record has been fully validated.
  */
-static void
+static bool
 allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 {
 	uint32		newSize = reclength;
@@ -198,8 +206,15 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
-	state->readRecordBuf = (char *) palloc(newSize);
+	state->readRecordBuf =
+		(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
+	if (state->readRecordBuf == NULL)
+	{
+		state->readRecordBufSize = 0;
+		return false;
+	}
 	state->readRecordBufSize = newSize;
+	return true;
 }
 
 /*
@@ -490,7 +505,9 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
 	if (allow_oversized)
 	{
-		decoded = palloc(required_space);
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
 		decoded->oversized = true;
 		return decoded;
 	}
@@ -798,7 +815,13 @@ restart:
 				Assert(gotlen <= lengthof(save_copy));
 				Assert(gotlen <= state->readRecordBufSize);
 				memcpy(save_copy, state->readRecordBuf, gotlen);
-				allocate_recordbuf(state, total_len);
+				if (!allocate_recordbuf(state, total_len))
+				{
+					/* We treat this as a "bogus data" condition */
+					report_invalid_record(state, "record length %u at %X/%X too long",
+										  total_len, LSN_FORMAT_ARGS(RecPtr));
+					goto err;
+				}
 				memcpy(state->readRecordBuf, save_copy, gotlen);
 				buffer = state->readRecordBuf + gotlen;
 			}
@@ -854,8 +877,16 @@ restart:
 		decoded = XLogReadRecordAlloc(state,
 									  total_len,
 									  true /* allow_oversized */ );
-		/* allocation should always happen under allow_oversized */
-		Assert(decoded != NULL);
+		if (decoded == NULL)
+		{
+			/*
+			 * We failed to allocate memory for an oversized record.  As
+			 * above, we currently treat this as a "bogus data" condition.
+			 */
+			report_invalid_record(state,
+								  "out of memory while trying to decode a record of length %u", total_len);
+			goto err;
+		}
 	}
 
 	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
-- 
2.42.0

v5-0002-Add-infrastructure-to-report-error-codes-in-WAL-r.patchtext/x-diff; charset=us-asciiDownload
From 970f1374272c058be51eedf77585d98b925b25c0 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:40:05 +0900
Subject: [PATCH v5 2/4] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h           |   2 +-
 src/include/access/xlogreader.h               |  33 +++-
 src/backend/access/transam/twophase.c         |   8 +-
 src/backend/access/transam/xlog.c             |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c       | 170 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  14 +-
 src/backend/access/transam/xlogutils.c        |   2 +-
 src/backend/replication/logical/logical.c     |   9 +-
 .../replication/logical/logicalfuncs.c        |   9 +-
 src/backend/replication/slotfuncs.c           |   8 +-
 src/backend/replication/walsender.c           |   8 +-
 src/bin/pg_rewind/parsexlog.c                 |  24 +--
 src/bin/pg_waldump/pg_waldump.c               |  10 +-
 contrib/pg_walinspect/pg_walinspect.c         |  11 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 201 insertions(+), 119 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+											XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..06664dc6fb 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,24 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NO_ERROR = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -307,9 +325,9 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-	/* Buffer to hold error message */
-	char	   *errormsg_buf;
-	bool		errormsg_deferred;
+	/* Error state data */
+	XLogReaderError errordata;
+	bool		errordata_deferred;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -324,7 +342,8 @@ struct XLogReaderState
 static inline bool
 XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
 {
-	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+	return (state->decode_queue_head != NULL) ||
+		state->errordata_deferred;
 }
 
 /* Get a new XLogReader */
@@ -355,11 +374,11 @@ typedef enum XLogPageReadResult
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
-										 char **errormsg);
+										 XLogReaderError *errordata);
 
 /* Release the previously returned record, if necessary. */
 extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
@@ -399,7 +418,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 							 DecodedXLogRecord *decoded,
 							 XLogRecord *record,
 							 XLogRecPtr lsn,
-							 char **errormsg);
+							 XLogReaderError *errordata);
 
 /*
  * Macros that provide access to parts of the record most recently returned by
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..08bd6586ec 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1399,7 +1399,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1413,15 +1413,15 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read two-phase state from WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(lsn), errormsg)));
+							LSN_FORMAT_ARGS(lsn), errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..56dd9f5b64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -953,7 +953,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		DecodedXLogRecord *decoded;
 		StringInfoData buf;
 		StringInfoData recordBuf;
-		char	   *errormsg = NULL;
+		XLogReaderError errordata = {0};
 		MemoryContext oldCxt;
 
 		oldCxt = MemoryContextSwitchTo(walDebugCxt);
@@ -987,10 +987,10 @@ XLogInsertRecord(XLogRecData *rdata,
 								   decoded,
 								   record,
 								   EndPos,
-								   &errormsg))
+								   &errordata))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
-							 errormsg ? errormsg : "no error message");
+							 errordata.message ? errordata.message : "no error message");
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..92d691ca49 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -984,7 +984,7 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
 XLogRecord *
-XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, XLogReaderError *errdata)
 {
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
@@ -1052,7 +1052,7 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/* Read the next record. */
-	record = XLogNextRecord(prefetcher->reader, errmsg);
+	record = XLogNextRecord(prefetcher->reader, errdata);
 	if (!record)
 		return NULL;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a17263df20..fd1413b6d3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -41,8 +41,10 @@
 #include "common/logging.h"
 #endif
 
-static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
-			pg_attribute_printf(2, 3);
+static void report_invalid_record(XLogReaderState *state,
+								  XLogReaderErrorCode errorcode,
+								  const char *fmt,...)
+			pg_attribute_printf(3, 4);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
@@ -66,21 +68,23 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 #define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
 
 /*
- * Construct a string in state->errormsg_buf explaining what's wrong with
+ * Construct a string in state->errordata.message explaining what's wrong with
  * the current record being read.
  */
 static void
-report_invalid_record(XLogReaderState *state, const char *fmt,...)
+report_invalid_record(XLogReaderState *state, XLogReaderErrorCode errorcode,
+					  const char *fmt,...)
 {
 	va_list		args;
 
 	fmt = _(fmt);
 
 	va_start(args, fmt);
-	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
+	vsnprintf(state->errordata.message, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
 
-	state->errormsg_deferred = true;
+	state->errordata_deferred = true;
+	state->errordata.code = errorcode;
 }
 
 /*
@@ -141,15 +145,16 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
 	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
-	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
-										  MCXT_ALLOC_NO_OOM);
-	if (!state->errormsg_buf)
+	state->errordata.message = palloc_extended(MAX_ERRORMSG_LEN + 1,
+											   MCXT_ALLOC_NO_OOM);
+	if (!state->errordata.message)
 	{
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
 	}
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 
 	/*
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
@@ -157,7 +162,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 */
 	if (!allocate_recordbuf(state, 0))
 	{
-		pfree(state->errormsg_buf);
+		pfree(state->errordata.message);
 		pfree(state->readBuf);
 		pfree(state);
 		return NULL;
@@ -175,7 +180,7 @@ XLogReaderFree(XLogReaderState *state)
 	if (state->decode_buffer && state->free_decode_buffer)
 		pfree(state->decode_buffer);
 
-	pfree(state->errormsg_buf);
+	pfree(state->errordata.message);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	pfree(state->readBuf);
@@ -335,23 +340,27 @@ XLogReleasePreviousRecord(XLogReaderState *state)
  *
  * On success, a record is returned.
  *
- * The returned record (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogNextRecord.
+ * The returned record (or errordata->message) points to an internal buffer
+ * that's valid until the next call to XLogNextRecord.
  */
 DecodedXLogRecord *
-XLogNextRecord(XLogReaderState *state, char **errormsg)
+XLogNextRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
 
 	if (state->decode_queue_head == NULL)
 	{
-		*errormsg = NULL;
-		if (state->errormsg_deferred)
+		errordata->message = NULL;
+		errordata->code = XLOG_READER_NO_ERROR;
+		if (state->errordata_deferred)
 		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			state->errormsg_deferred = false;
+			if (state->errordata.message[0] != '\0')
+				errordata->message = state->errordata.message;
+			if (state->errordata.code != XLOG_READER_NO_ERROR)
+				errordata->code = state->errordata.code;
+			state->errordata_deferred = false;
+			state->errordata.code = XLOG_READER_NO_ERROR;
 		}
 
 		/*
@@ -381,7 +390,8 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	state->ReadRecPtr = state->record->lsn;
 	state->EndRecPtr = state->record->next_lsn;
 
-	*errormsg = NULL;
+	errordata->message = NULL;
+	errordata->code = XLOG_READER_NO_ERROR;
 
 	return state->record;
 }
@@ -393,17 +403,17 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error;
+ * errordata->message is set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
- * *errormsg is set to a string with details of the failure.
+ * *errordata is set with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * The returned pointer (or *errordata.message) points to an internal
+ * buffer that's valid until the next call to XLogReadRecord.
  */
 XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecord(XLogReaderState *state, XLogReaderError *errordata)
 {
 	DecodedXLogRecord *decoded;
 
@@ -421,7 +431,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		XLogReadAhead(state, false /* nonblocking */ );
 
 	/* Consume the head record or error. */
-	decoded = XLogNextRecord(state, errormsg);
+	decoded = XLogNextRecord(state, errordata);
 	if (decoded)
 	{
 		/*
@@ -530,7 +540,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	bool		gotheader;
 	int			readOff;
 	DecodedXLogRecord *decoded;
-	char	   *errormsg;		/* not used */
+	XLogReaderError errordata = {0};	/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -540,7 +550,8 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	randAccess = false;
 
 	/* reset error state */
-	state->errormsg_buf[0] = '\0';
+	state->errordata.message[0] = '\0';
+	state->errordata.code = XLOG_READER_NO_ERROR;
 	decoded = NULL;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -607,7 +618,9 @@ restart:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "invalid record offset at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  pageHeaderSize, targetRecOff);
 		goto err;
@@ -616,7 +629,9 @@ restart:
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
+		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
+							  "contrecord is requested by %X/%X",
 							  LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
@@ -657,6 +672,7 @@ restart:
 		if (total_len < SizeOfXLogRecord)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid record length at %X/%X: expected at least %u, got %u",
 								  LSN_FORMAT_ARGS(RecPtr),
 								  (uint32) SizeOfXLogRecord, total_len);
@@ -746,6 +762,7 @@ restart:
 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "there is no contrecord flag at %X/%X",
 									  LSN_FORMAT_ARGS(RecPtr));
 				goto err;
@@ -759,6 +776,7 @@ restart:
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "invalid contrecord length %u (expected %lld) at %X/%X",
 									  pageHeader->xlp_rem_len,
 									  ((long long) total_len) - gotlen,
@@ -817,8 +835,10 @@ restart:
 				memcpy(save_copy, state->readRecordBuf, gotlen);
 				if (!allocate_recordbuf(state, total_len))
 				{
-					/* We treat this as a "bogus data" condition */
-					report_invalid_record(state, "record length %u at %X/%X too long",
+					/* We treat this as an out-of-memory error */
+					report_invalid_record(state,
+										  XLOG_READER_OOM,
+										  "record length %u at %X/%X too long",
 										  total_len, LSN_FORMAT_ARGS(RecPtr));
 					goto err;
 				}
@@ -881,15 +901,16 @@ restart:
 		{
 			/*
 			 * We failed to allocate memory for an oversized record.  As
-			 * above, we currently treat this as a "bogus data" condition.
+			 * above, we currently treat this as an out-of-memory error.
 			 */
 			report_invalid_record(state,
+								  XLOG_READER_OOM,
 								  "out of memory while trying to decode a record of length %u", total_len);
 			goto err;
 		}
 	}
 
-	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errordata))
 	{
 		/* Record the location of the next record. */
 		decoded->next_lsn = state->NextRecPtr;
@@ -938,7 +959,7 @@ err:
 		 * queued so that XLogPrefetcherReadRecord() doesn't bring us back a
 		 * second time and clobber the above state.
 		 */
-		state->errormsg_deferred = true;
+		state->errordata_deferred = true;
 	}
 
 	if (decoded && decoded->oversized)
@@ -951,9 +972,9 @@ err:
 	XLogReaderInvalReadState(state);
 
 	/*
-	 * If an error was written to errmsg_buf, it'll be returned to the caller
-	 * of XLogReadRecord() after all successfully decoded records from the
-	 * read queue.
+	 * If an error was written to errordata.message, it'll be returned to the
+	 * caller of XLogReadRecord() after all successfully decoded records from
+	 * the read queue.
 	 */
 
 	return XLREAD_FAIL;
@@ -972,7 +993,7 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
 {
 	XLogPageReadResult result;
 
-	if (state->errormsg_deferred)
+	if (state->errordata_deferred)
 		return NULL;
 
 	result = XLogDecodeNextRecord(state, nonblocking);
@@ -990,8 +1011,8 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
  * via the page_read() callback.
  *
  * Returns XLREAD_FAIL if the required page cannot be read for some
- * reason; errormsg_buf is set in that case (unless the error occurs in the
- * page_read callback).
+ * reason; errordata.message is set in that case (unless the error occurs in
+ * the page_read callback).
  *
  * Returns XLREAD_WOULDBLOCK if the requested data can't be read without
  * waiting.  This can be returned only if the installed page_read callback
@@ -1136,6 +1157,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid record length at %X/%X: expected at least %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
@@ -1144,6 +1166,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid resource manager ID %u at %X/%X",
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
@@ -1157,6 +1180,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (!(record->xl_prev < RecPtr))
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1173,6 +1197,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		if (record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "record with incorrect prev-link %X/%X at %X/%X",
 								  LSN_FORMAT_ARGS(record->xl_prev),
 								  LSN_FORMAT_ARGS(RecPtr));
@@ -1211,6 +1236,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "incorrect resource manager data checksum in record at %X/%X",
 							  LSN_FORMAT_ARGS(recptr));
 		return false;
@@ -1245,6 +1271,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_magic,
 							  fname,
@@ -1260,6 +1287,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1276,6 +1304,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			longhdr->xlp_sysid != state->system_identifier)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
 								  (unsigned long long) longhdr->xlp_sysid,
 								  (unsigned long long) state->system_identifier);
@@ -1284,12 +1313,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect segment size in page header");
 			return false;
 		}
 		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
 			return false;
 		}
@@ -1302,6 +1333,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 							  hdr->xlp_info,
 							  fname,
@@ -1322,6 +1354,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
+							  XLOG_READER_INVALID_DATA,
 							  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
@@ -1348,6 +1381,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
 								  hdr->xlp_tli,
 								  state->latestPageTLI,
@@ -1369,8 +1403,9 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 void
 XLogReaderResetError(XLogReaderState *state)
 {
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1390,7 +1425,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -1475,7 +1510,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errordata) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1620,8 +1655,9 @@ ResetDecoder(XLogReaderState *state)
 	state->decode_buffer_head = state->decode_buffer;
 
 	/* Clear error state. */
-	state->errormsg_buf[0] = '\0';
-	state->errormsg_deferred = false;
+	state->errordata.message[0] = '\0';
+	state->errordata_deferred = false;
+	state->errordata.code = XLOG_READER_NO_ERROR;
 }
 
 /*
@@ -1671,7 +1707,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				 DecodedXLogRecord *decoded,
 				 XLogRecord *record,
 				 XLogRecPtr lsn,
-				 char **errormsg)
+				 XLogReaderError *errordata)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1754,6 +1790,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "out-of-order block_id %u at %X/%X",
 									  block_id,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1778,6 +1815,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (blk->has_data && blk->data_len == 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
@@ -1785,6 +1823,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			if (!blk->has_data && blk->data_len != 0)
 			{
 				report_invalid_record(state,
+									  XLOG_READER_INVALID_DATA,
 									  "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
 									  (unsigned int) blk->data_len,
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1821,6 +1860,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					 blk->bimg_len == BLCKSZ))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1837,6 +1877,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					(blk->hole_offset != 0 || blk->hole_length != 0))
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
 										  (unsigned int) blk->hole_offset,
 										  (unsigned int) blk->hole_length,
@@ -1851,6 +1892,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len == BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
 										  (unsigned int) blk->bimg_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1866,6 +1908,7 @@ DecodeXLogRecord(XLogReaderState *state,
 					blk->bimg_len != BLCKSZ)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
 										  (unsigned int) blk->data_len,
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
@@ -1882,6 +1925,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				if (rlocator == NULL)
 				{
 					report_invalid_record(state,
+										  XLOG_READER_INVALID_DATA,
 										  "BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
 										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
@@ -1894,6 +1938,7 @@ DecodeXLogRecord(XLogReaderState *state,
 		else
 		{
 			report_invalid_record(state,
+								  XLOG_READER_INVALID_DATA,
 								  "invalid block_id %u at %X/%X",
 								  block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
 			goto err;
@@ -1961,10 +2006,12 @@ DecodeXLogRecord(XLogReaderState *state,
 
 shortdata_err:
 	report_invalid_record(state,
+						  XLOG_READER_INVALID_DATA,
 						  "record with invalid length at %X/%X",
 						  LSN_FORMAT_ARGS(state->ReadRecPtr));
 err:
-	*errormsg = state->errormsg_buf;
+	errordata->message = state->errordata.message;
+	errordata->code = state->errordata.code;
 
 	return false;
 }
@@ -2071,6 +2118,7 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		!record->record->blocks[block_id].in_use)
 	{
 		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
 							  "could not restore image at %X/%X with invalid block %d specified",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
@@ -2078,7 +2126,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	}
 	if (!record->record->blocks[block_id].has_image)
 	{
-		report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
+		report_invalid_record(record,
+							  XLOG_READER_INVALID_DATA,
+							  "could not restore image at %X/%X with invalid state, block %d",
 							  LSN_FORMAT_ARGS(record->ReadRecPtr),
 							  block_id);
 		return false;
@@ -2105,7 +2155,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 									bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "LZ4",
 								  block_id);
@@ -2122,7 +2174,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 			if (ZSTD_isError(decomp_result))
 				decomp_success = false;
 #else
-			report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with %s not supported by build, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  "zstd",
 								  block_id);
@@ -2131,7 +2185,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 		}
 		else
 		{
-			report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not restore image at %X/%X compressed with unknown method, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
@@ -2139,7 +2195,9 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 		if (!decomp_success)
 		{
-			report_invalid_record(record, "could not decompress image at %X/%X, block %d",
+			report_invalid_record(record,
+								  XLOG_READER_INVALID_DATA,
+								  "could not decompress image at %X/%X, block %d",
 								  LSN_FORMAT_ARGS(record->ReadRecPtr),
 								  block_id);
 			return false;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..68100bfa4a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2454,7 +2454,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		if (!RestoreBlockImage(record, block_id, primary_image_masked))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * If masking function is defined, mask both the primary and replay
@@ -3062,9 +3062,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 	for (;;)
 	{
-		char	   *errormsg;
+		XLogReaderError errordata = {0};
 
-		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
 			/*
@@ -3098,9 +3098,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * StandbyMode that only happens if we have been triggered, so we
 			 * shouldn't loop anymore in that case.
 			 */
-			if (errormsg)
+			if (errordata.message)
 				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+						(errmsg_internal("%s", errordata.message) /* already translated */ ));
 		}
 
 		/*
@@ -3385,9 +3385,9 @@ retry:
 		 * Emit this error right now then retry this page immediately. Use
 		 * errmsg_internal() because the message was already translated.
 		 */
-		if (xlogreader->errormsg_buf[0])
+		if (xlogreader->errordata.message[0])
 			ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-					(errmsg_internal("%s", xlogreader->errormsg_buf)));
+					(errmsg_internal("%s", xlogreader->errordata.message)));
 
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		XLogReaderResetError(xlogreader);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..a50fc9cb97 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -395,7 +395,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 		if (!RestoreBlockImage(record, block_id, page))
 			ereport(ERROR,
 					(errcode(ERRCODE_INTERNAL_ERROR),
-					 errmsg_internal("%s", record->errormsg_buf)));
+					 errmsg_internal("%s", record->errordata.message)));
 
 		/*
 		 * The page may be uninitialized. If so, we can't set the LSN because
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 41243d0187..f48feab944 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -641,12 +641,13 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	for (;;)
 	{
 		XLogRecord *record;
-		char	   *err = NULL;
+		XLogReaderError errordata = {0};
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
-		if (err)
-			elog(ERROR, "could not find logical decoding starting point: %s", err);
+		record = XLogReadRecord(ctx->reader, &errordata);
+		if (errordata.message)
+			elog(ERROR, "could not find logical decoding starting point: %s",
+				 errordata.message);
 		if (!record)
 			elog(ERROR, "could not find logical decoding starting point");
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 197169d6b0..ca372e5f66 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,11 +244,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		while (ctx->reader->EndRecPtr < end_of_wal)
 		{
 			XLogRecord *record;
-			char	   *errm = NULL;
+			XLogReaderError errordata = {0};
 
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
-				elog(ERROR, "could not find record for logical decoding: %s", errm);
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
+				elog(ERROR, "could not find record for logical decoding: %s",
+					 errordata.message);
 
 			/*
 			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6035cf4816..4fa4e6bfed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -503,17 +503,17 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		/* Decode at least one record, until we run out of records */
 		while (ctx->reader->EndRecPtr < moveto)
 		{
-			char	   *errm = NULL;
 			XLogRecord *record;
+			XLogReaderError errordata = {0};
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
+			record = XLogReadRecord(ctx->reader, &errordata);
+			if (errordata.message)
 				elog(ERROR, "could not find record while advancing replication slot: %s",
-					 errm);
+					 errordata.message);
 
 			/*
 			 * Process the record.  Storage-level changes are ignored in
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..55109bfa51 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3045,7 +3045,7 @@ static void
 XLogSendLogical(void)
 {
 	XLogRecord *record;
-	char	   *errm;
+	XLogReaderError errordata = {0};
 
 	/*
 	 * We'll use the current flush point to determine whether we've caught up.
@@ -3063,12 +3063,12 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	record = XLogReadRecord(logical_decoding_ctx->reader, &errordata);
 
 	/* xlog record was invalid */
-	if (errm != NULL)
+	if (errordata.message != NULL)
 		elog(ERROR, "could not find record while sending logically-decoded data: %s",
-			 errm);
+			 errordata.message);
 
 	if (record != NULL)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 0233ece88b..e30fb311f8 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -68,7 +68,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	private.tliIndex = tliIndex;
@@ -82,16 +82,16 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
 			XLogRecPtr	errptr = xlogreader->EndRecPtr;
 
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not read WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(errptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not read WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(errptr));
@@ -126,7 +126,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
@@ -139,12 +139,12 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 	if (record == NULL)
 	{
-		if (errormsg)
+		if (errordata.message)
 			pg_fatal("could not read WAL record at %X/%X: %s",
-					 LSN_FORMAT_ARGS(ptr), errormsg);
+					 LSN_FORMAT_ARGS(ptr), errordata.message);
 		else
 			pg_fatal("could not read WAL record at %X/%X",
 					 LSN_FORMAT_ARGS(ptr));
@@ -173,7 +173,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecord *record;
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 	XLogPageReadPrivate private;
 
 	/*
@@ -204,14 +204,14 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogReadRecord(xlogreader, &errordata);
 
 		if (record == NULL)
 		{
-			if (errormsg)
+			if (errordata.message)
 				pg_fatal("could not find previous WAL record at %X/%X: %s",
 						 LSN_FORMAT_ARGS(searchptr),
-						 errormsg);
+						 errordata.message);
 			else
 				pg_fatal("could not find previous WAL record at %X/%X",
 						 LSN_FORMAT_ARGS(searchptr));
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a3535bdfa9..880c93b51b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -512,7 +512,7 @@ XLogRecordSaveFPWs(XLogReaderState *record, const char *savepath)
 
 		/* Full page exists, so let's save it */
 		if (!RestoreBlockImage(record, block_id, page))
-			pg_fatal("%s", record->errormsg_buf);
+			pg_fatal("%s", record->errordata.message);
 
 		(void) XLogRecGetBlockTagExtended(record, block_id,
 										  &rnode, &fork, &blk, NULL);
@@ -800,7 +800,7 @@ main(int argc, char **argv)
 	XLogRecord *record;
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
@@ -1243,7 +1243,7 @@ main(int argc, char **argv)
 		}
 
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		record = XLogReadRecord(xlogreader_state, &errordata);
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
@@ -1308,10 +1308,10 @@ main(int argc, char **argv)
 	if (time_to_stop)
 		exit(0);
 
-	if (errormsg)
+	if (errordata.message)
 		pg_fatal("error in WAL record at %X/%X: %s",
 				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+				 errordata.message);
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 796a74f322..e7d30554ed 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -146,9 +146,9 @@ static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
-	char	   *errormsg;
+	XLogReaderError errordata = {0};
 
-	record = XLogReadRecord(xlogreader, &errormsg);
+	record = XLogReadRecord(xlogreader, &errordata);
 
 	if (record == NULL)
 	{
@@ -161,11 +161,12 @@ ReadNextXLogRecord(XLogReaderState *xlogreader)
 		if (private_data->end_of_wal)
 			return NULL;
 
-		if (errormsg)
+		if (errordata.message)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read WAL at %X/%X: %s",
-							LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+							LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+							errordata.message)));
 		else
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -384,7 +385,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
 			if (!RestoreBlockImage(record, block_id, page))
 				ereport(ERROR,
 						(errcode(ERRCODE_INTERNAL_ERROR),
-						 errmsg_internal("%s", record->errormsg_buf)));
+						 errmsg_internal("%s", record->errordata.message)));
 
 			block_fpi_data = (bytea *) palloc(BLCKSZ + VARHDRSZ);
 			SET_VARSIZE(block_fpi_data, BLCKSZ + VARHDRSZ);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de90c4958..9cb62b00d5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3081,6 +3081,8 @@ XLogPageReadResult
 XLogPrefetchStats
 XLogPrefetcher
 XLogPrefetcherFilter
+XLogReaderError
+XLogReaderErrorCode
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.42.0

v5-0003-Make-WAL-replay-more-robust-on-OOM-failures.patchtext/x-diff; charset=us-asciiDownload
From 134f203907083f557075afda695b778af627b318 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:23:37 +0900
Subject: [PATCH v5 3/4] Make WAL replay more robust on OOM failures

This takes advantage of the new error facility for WAL readers, allowing
WAL replay to loop when an out-of-memory happens when reading a record.
This was the origin of potential data loss scenarios, making crash
recovery more robust by acting the same way as a standby here: each time
a record cannot be read because of an OOM, loop and try to read again
the record.
---
 src/backend/access/transam/xlogrecovery.c | 75 ++++++++++++++++-------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 68100bfa4a..ed5ac06938 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3067,29 +3067,50 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errordata);
 		if (record == NULL)
 		{
-			/*
-			 * When we find that WAL ends in an incomplete record, keep track
-			 * of that record.  After recovery is done, we'll write a record
-			 * to indicate to downstream WAL readers that that portion is to
-			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
-			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+			switch (errordata.code)
 			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+				case XLOG_READER_NO_ERROR:
+					/* Possible when XLogPageRead() has failed */
+					Assert(!errordata.message);
+					/* FALLTHROUGH */
 
-			if (readFile >= 0)
-			{
-				close(readFile);
-				readFile = -1;
+				case XLOG_READER_INVALID_DATA:
+
+					/*
+					 * When we find that WAL ends in an incomplete record,
+					 * keep track of that record.  After recovery is done,
+					 * we'll write a record to indicate to downstream WAL
+					 * readers that that portion is to be ignored.
+					 *
+					 * However, when ArchiveRecoveryRequested = true, we're
+					 * going to switch to a new timeline at the end of
+					 * recovery. We will only copy WAL over to the new
+					 * timeline up to the end of the last complete record, so
+					 * if we did this, we would later create an overwrite
+					 * contrecord in the wrong place, breaking everything.
+					 */
+					if (!ArchiveRecoveryRequested &&
+						!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+					{
+						abortedRecPtr = xlogreader->abortedRecPtr;
+						missingContrecPtr = xlogreader->missingContrecPtr;
+					}
+
+					if (readFile >= 0)
+					{
+						close(readFile);
+						readFile = -1;
+					}
+					break;
+				case XLOG_READER_OOM:
+
+					/*
+					 * If we failed because of an out-of-memory problem, just
+					 * give up and retry recovery later.  It may be posible
+					 * that the WAL record to decode required a larger memory
+					 * allocation than what the host can offer.
+					 */
+					break;
 			}
 
 			/*
@@ -3147,9 +3168,12 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * WAL from the archive, even if pg_wal is completely empty, but
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
+			 *
+			 * It may be possible that the record was not decoded because of
+			 * an out-of-memory failure.  In this case, just loop.
 			 */
 			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+				!fetching_ckpt && errordata.code != XLOG_READER_OOM)
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
@@ -3173,9 +3197,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
+			/*
+			 * In standby mode or if the WAL record failed on an
+			 * out-of-memory, loop back and retry.  Otherwise, give up.
+			 */
 			if (StandbyMode && !CheckForStandbyTrigger())
 				continue;
+			else if (errordata.code == XLOG_READER_OOM)
+				continue;
 			else
 				return NULL;
 		}
-- 
2.42.0

v5-0004-Tweak-to-force-OOM-behavior-when-replaying-record.patchtext/x-diff; charset=us-asciiDownload
From 1b2c50fd98a062d0b4617f8d618cdca4d6428e5a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 26 Sep 2023 15:23:50 +0900
Subject: [PATCH v5 4/4] Tweak to force OOM behavior when replaying records

---
 src/backend/access/transam/xlogreader.c | 31 +++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fd1413b6d3..854f584e30 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -541,6 +541,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	XLogReaderError errordata = {0};	/* not used */
+	bool        trigger_oom = false;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -690,7 +691,29 @@ restart:
 	decoded = XLogReadRecordAlloc(state,
 								  total_len,
 								  false /* allow_oversized */ );
-	if (decoded == NULL && nonblocking)
+
+#ifndef FRONTEND
+	/*
+	 * Trick to emulate an OOM after a hardcoded number of records
+	 * replayed.
+	 */
+	{
+		struct stat fstat;
+		static int counter = 0;
+		if (stat("/tmp/xlogreader_oom", &fstat) == 0)
+		{
+			counter++;
+			if (counter >= 100)
+			{
+				trigger_oom = true;
+				/* Reset counter, to not fail when shutting down WAL */
+				counter = 0;
+			}
+		}
+	}
+#endif
+
+	if ((decoded == NULL || trigger_oom) && nonblocking)
 	{
 		/*
 		 * There is no space in the circular decode buffer, and the caller is
@@ -833,7 +856,7 @@ restart:
 				Assert(gotlen <= lengthof(save_copy));
 				Assert(gotlen <= state->readRecordBufSize);
 				memcpy(save_copy, state->readRecordBuf, gotlen);
-				if (!allocate_recordbuf(state, total_len))
+				if (!allocate_recordbuf(state, total_len) || trigger_oom)
 				{
 					/* We treat this as an out-of-memory error */
 					report_invalid_record(state,
@@ -891,13 +914,13 @@ restart:
 	 * If we got here without a DecodedXLogRecord, it means we needed to
 	 * validate total_len before trusting it, but by now now we've done that.
 	 */
-	if (decoded == NULL)
+	if (decoded == NULL || trigger_oom)
 	{
 		Assert(!nonblocking);
 		decoded = XLogReadRecordAlloc(state,
 									  total_len,
 									  true /* allow_oversized */ );
-		if (decoded == NULL)
+		if (decoded == NULL || trigger_oom)
 		{
 			/*
 			 * We failed to allocate memory for an oversized record.  As
-- 
2.42.0

#24Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#23)
Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Oct 03, 2023 at 04:20:45PM +0900, Michael Paquier wrote:

If there's no interest in this patch set after the next CF, I'm OK to
drop it. The state of HEAD is at least correct in the OOM cases now.

I have been thinking about this patch for the last few days, and in
light of 6b18b3fe2c2f I am going to withdraw it for now as this makes
the error layers of the xlogreader more complicated.
--
Michael