Logical replication error "no record found" /* shouldn't happen */

Started by Andrey Borodinover 4 years ago3 messages
#1Andrey Borodin
x4mmm@yandex-team.ru
1 attachment(s)

Hi!

From time to time I observe $subj on clusters using logical replication.
I most of cases there are a lot of other errors. Probably $subj condition should be kind of impossible without other problems.
I propose to enhance error logging of XLogReadRecord() in ReadPageInternal().

Thank you!

Best regards, Andrey Borodin.

Attachments:

0001-Improve-error-reporting-of-ReadPageInternal.patchapplication/octet-stream; name=0001-Improve-error-reporting-of-ReadPageInternal.patch; x-unix-mode=0644Download
From cac77be465755ff587d692c60fce6c6959d3f015 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 23 Jul 2021 14:04:56 +0500
Subject: [PATCH] Improve error reporting of ReadPageInternal()

Currently from time to time error "no record found" happens.
This is shouldn't happen error, yet it is observed. This commit
improves errormsg reported in this case.
---
 src/backend/access/transam/xlogreader.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..926255f6ef 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -609,14 +609,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 										   state->currRecPtr,
 										   state->readBuf);
 		if (readLen < 0)
+		{
+			report_invalid_record(state, "Attempt to read page of next segment failed at %X/%X",
+							  (uint32) (pageptr >> 32), (uint32) pageptr);
 			goto err;
+		}
 
 		/* we can be sure to have enough WAL available, we scrolled back */
 		Assert(readLen == XLOG_BLCKSZ);
 
 		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
 										  state->readBuf))
+		{
+			report_invalid_record(state, "Invalid page header at %X/%X",
+							  (uint32) (pageptr >> 32), (uint32) pageptr);
 			goto err;
+		}
 	}
 
 	/*
@@ -627,13 +635,21 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 									   state->currRecPtr,
 									   state->readBuf);
 	if (readLen < 0)
+	{
+		report_invalid_record(state, "Attempt to read page failed at %X/%X",
+							(uint32) (pageptr >> 32), (uint32) pageptr);
 		goto err;
+	}
 
 	Assert(readLen <= XLOG_BLCKSZ);
 
 	/* Do we have enough data to check the header length? */
 	if (readLen <= SizeOfXLogShortPHD)
+	{
+		report_invalid_record(state, "Unable to read short header of %d bytes at %X/%X", readLen,
+							(uint32) (pageptr >> 32), (uint32) pageptr);
 		goto err;
+	}
 
 	Assert(readLen >= reqLen);
 
@@ -646,14 +662,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 										   state->currRecPtr,
 										   state->readBuf);
 		if (readLen < 0)
+		{
+			report_invalid_record(state, "Unable to read long header of %d bytes at %X/%X", readLen,
+								(uint32) (pageptr >> 32), (uint32) pageptr);
 			goto err;
+		}
 	}
 
 	/*
 	 * Now that we know we have the full header, validate it.
 	 */
 	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
+	{
+		report_invalid_record(state, "Invalid full page header at %X/%X",
+							(uint32) (pageptr >> 32), (uint32) pageptr);
 		goto err;
+	}
 
 	/* update read state information */
 	state->seg.ws_segno = targetSegNo;
-- 
2.24.3 (Apple Git-128)

#2Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Andrey Borodin (#1)
Re: Logical replication error "no record found" /* shouldn't happen */

Hello.

I saw this error multiple times trying to replicate the 2-3 TB server
(version 11 to version 12). I was unable to find any explanation for
this error.

Thanks,
Michail.

#3Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andrey Borodin (#1)
1 attachment(s)
Re: Logical replication error "no record found" /* shouldn't happen */

On 2021-Jul-23, Andrey Borodin wrote:

Hi!

From time to time I observe $subj on clusters using logical replication.
I most of cases there are a lot of other errors. Probably $subj condition should be kind of impossible without other problems.
I propose to enhance error logging of XLogReadRecord() in ReadPageInternal().

Hmm.

A small problem in this patch is that XLogReaderValidatePageHeader
already sets errormsg_buf; you're overwriting that. I suggest to leave
that untouched. There are other two cases where the problem occurs in
page_read() callback; ReadPageInternal explicitly documents that it
doesn't set the error in that case. We have two options to deal with
that:

1. change all existing callbacks to set the errormsg_buf depending on
what actually fails, and then if they return failure without an error
message, add something like your proposed message.
2. throw error directly in the callback rather than returning. I don't
think this strategy actually works

I attach a cut-down patch that doesn't deal with the page_read callbacks
issue, just added stub comments in xlog.c where something should be
done.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this." (Fotis)
(http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

Attachments:

errormsg.patchtext/x-diff; charset=utf-8Download
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..83976cb014 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12467,6 +12467,7 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			/* XXX should this path set errormsg_buf? */
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
@@ -12598,7 +12599,10 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
+	{
+		/* XXX should set errormsg_buf here */
 		return -1;
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a7de02565..5b61593820 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -650,14 +650,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 										   state->currRecPtr,
 										   state->readBuf);
 		if (readLen < 0)
+		{
+			report_invalid_record(state,
+								  "attempt to read page of next segment failed at %X/%X",
+								  LSN_FORMAT_ARGS(targetSegmentPtr));
 			goto err;
+		}
 
 		/* we can be sure to have enough WAL available, we scrolled back */
 		Assert(readLen == XLOG_BLCKSZ);
 
 		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
 										  state->readBuf))
+		{
+			/* XLogReaderValidatePageHeader sets errormsg_buf */
 			goto err;
+		}
 	}
 
 	/*
@@ -668,13 +676,18 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 									   state->currRecPtr,
 									   state->readBuf);
 	if (readLen < 0)
-		goto err;
+		goto err;	/* XXX errmsg? */
 
 	Assert(readLen <= XLOG_BLCKSZ);
 
 	/* Do we have enough data to check the header length? */
 	if (readLen <= SizeOfXLogShortPHD)
+	{
+		report_invalid_record(state,
+							  "unable to read short header of %d bytes at %X/%X",
+							  readLen, LSN_FORMAT_ARGS(pageptr));
 		goto err;
+	}
 
 	Assert(readLen >= reqLen);
 
@@ -687,14 +700,17 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 										   state->currRecPtr,
 										   state->readBuf);
 		if (readLen < 0)
-			goto err;
+			goto err;	/* XXX errmsg */
 	}
 
 	/*
 	 * Now that we know we have the full header, validate it.
 	 */
 	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
+	{
+		/* XLogReaderValidatePageHeader sets errormsg_buf */
 		goto err;
+	}
 
 	/* update read state information */
 	state->seg.ws_segno = targetSegNo;