standby promotion can create unreadable WAL

Started by Robert Haasover 3 years ago18 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

My colleague Dilip Kumar and I have discovered what I believe to be a
bug in the recently-added "overwrite contrecord" stuff. I'm not sure
whether or not this bug has any serious consequences. I think that
there may be a scenario where it does, but I'm not sure about that.

Suppose you have a primary and a standby, and the standby is promoted
after reading a partial WAL record. The attached script, which was
written by Dilip and slightly modified by me, creates this scenario by
setting up an archiving-only standby, writing a record that crosses a
segment boundary, and then promoting the standby. If you then try to
run pg_waldump on the WAL on timeline 2, it goes boom:

[rhaas pg_wal]$ pg_waldump 000000020000000000000004
000000020000000000000005 2>&1 | tail -n4
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFE7B0, prev 0/04FFDFF0, desc: INSERT off 4 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2132
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFEF58, prev 0/04FFE7B0, desc: INSERT+INIT off 1 flags 0x00,
blkref #0: rel 1663/5/16384 blk 2133
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFF700, prev 0/04FFEF58, desc: INSERT off 2 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2133
pg_waldump: error: error in WAL record at 0/4FFF700: invalid record
length at 0/4FFFEA8: wanted 24, got 0

What's happening here is that the last WAL segment from timeline 1,
which is 000000010000000000000004, gets copied over to the new
timeline up to the point where the last complete record on that
timeline ends, namely, 0/4FFFEA8. I think that the first record on the
new timeline should be written starting at that LSN, but that's not
what happens. Instead, the rest of that WAL segment remains zeroed,
and the first WAL record on the new timeline is written at the
beginning of the next segment:

[rhaas pg_wal]$ pg_waldump 000000020000000000000005 2>&1 | head -n4
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn:
0/05000028, prev 0/04FFF700, desc: OVERWRITE_CONTRECORD lsn 0/4FFFEA8;
time 2022-08-22 13:49:22.874435 EDT
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/05000058, prev 0/05000028, desc: CHECKPOINT_SHUTDOWN redo 0/5000058;
tli 2; prev tli 1; fpw true; xid 0:729; oid 24576; multi 1; offset 0;
oldest xid 719 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn:
0/050000D0, prev 0/05000058, desc: NEXTOID 32768
rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn:
0/050000F0, prev 0/050000D0, desc: CREATE base/5/24576

Nothing that uses xlogreader is going to be able to bridge the gap
between file #4 and file #5. In this case it doesn't matter very much,
because we immediately write a checkpoint record into file #5, so if
we crash we won't try to replay file #4 anyway. However, if anything
did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

I think that the cause of this problem is this code right here:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

It seems to me that this if-statement should also test that the TLI
has not changed i.e. if (newTLI != endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr)). If the TLI hasn't changed,
then everything the comment says is correct and I think that what the
code does is also correct. However, if the TLI *has* changed, then I
think we must not advance EndOfLog here, because the WAL that was
copied from the old timeline to the new timeline ends at the point in
the file corresponding to the value of EndOfLog just before executing
this code. When this code then moves EndOfLog forward to the beginning
of the next segment, it leaves the unused portion of the previous
segment as all zeroes, which creates the problem described above.

(Incidentally, there's also a bug in pg_waldump here: it's reporting
the wrong LSN as the source of the error. 0/4FFF700 is not the record
that's busted, as shown by the fact that it was successfully decoded
and shown in the output. The relevant code in pg_waldump should be
using EndRecPtr instead of ReadRecPtr to report the error. If it did,
it would complain about 0/4FFFEA8, which is where the problem really
is. This is of the same vintage as the bug fixed by
d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the
issue was reporting all errors using the start LSN of the first of
several records read no matter where the error actually happened,
whereas in this case the error is using the start LSN of the previous
record instead of the current one.)

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

test_contrecord.shtext/x-sh; charset=US-ASCII; name=test_contrecord.shDownload
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#1)
Re: standby promotion can create unreadable WAL

On Mon, Aug 22, 2022 at 02:36:36PM -0400, Robert Haas wrote:

(Incidentally, there's also a bug in pg_waldump here: it's reporting
the wrong LSN as the source of the error. 0/4FFF700 is not the record
that's busted, as shown by the fact that it was successfully decoded
and shown in the output. The relevant code in pg_waldump should be
using EndRecPtr instead of ReadRecPtr to report the error. If it did,
it would complain about 0/4FFFEA8, which is where the problem really
is. This is of the same vintage as the bug fixed by
d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the
issue was reporting all errors using the start LSN of the first of
several records read no matter where the error actually happened,
whereas in this case the error is using the start LSN of the previous
record instead of the current one.)

There was some previous discussion on this [0]/messages/by-id/2B4510B2-3D70-4990-BFE3-0FE64041C08A@amazon.com [1]/messages/by-id/20220127.100738.1985658263632578184.horikyota.ntt@gmail.com.

[0]: /messages/by-id/2B4510B2-3D70-4990-BFE3-0FE64041C08A@amazon.com
[1]: /messages/by-id/20220127.100738.1985658263632578184.horikyota.ntt@gmail.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#2)
Re: standby promotion can create unreadable WAL

On Mon, Aug 22, 2022 at 10:38 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

There was some previous discussion on this [0] [1].

[0] /messages/by-id/2B4510B2-3D70-4990-BFE3-0FE64041C08A@amazon.com
[1] /messages/by-id/20220127.100738.1985658263632578184.horikyota.ntt@gmail.com

Thanks. It seems like there are various doubts on those threads about
whether EndRecPtr is really the right thing, but I'm not able to
understand what the problem is exactly. Certainly, in the common case,
it is, and as far as I can tell from looking at the code, it's what
we're intended to use. So I would like to see some concrete evidence
of it being wrong before we conclude that we need to do anything other
than a trivial change.

But the main issue for this thread is that we seem to be generating
invalid WAL. That seems like something we'd better get fixed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#1)
Re: standby promotion can create unreadable WAL

On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

Nothing that uses xlogreader is going to be able to bridge the gap
between file #4 and file #5. In this case it doesn't matter very much,
because we immediately write a checkpoint record into file #5, so if
we crash we won't try to replay file #4 anyway. However, if anything
did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

Yeah, this analysis looks correct to me.

I think that the cause of this problem is this code right here:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Yeah, this statement as well as another statement that creates the
overwrite contrecord. After changing these two lines the problem is
fixed for me. Although I haven't yet thought of all the scenarios
that whether it is safe in all the cases. I agree that after timeline
changes we are pointing to the end of the last valid record we can
start writing the next record from that point onward. But I think we
should need to think hard that whether it will break any case for
which the overwrite contrecord was actually introduced.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 7602fc8..3d38613 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5491,7 +5491,7 @@ StartupXLOG(void)
         * (It's critical to first write an OVERWRITE_CONTRECORD message, which
         * we'll do as soon as we're open for writing new WAL.)
         */
-       if (!XLogRecPtrIsInvalid(missingContrecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr))
        {
                Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
                EndOfLog = missingContrecPtr;
@@ -5589,7 +5589,7 @@ StartupXLOG(void)
        LocalSetXLogInsertAllowed();
        /* If necessary, write overwrite-contrecord before doing
anything else */
-       if (!XLogRecPtrIsInvalid(abortedRecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(abortedRecPtr))
        {
                Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
                CreateOverwriteContrecordRecord(abortedRecPtr,
missingContrecPtr, newTLI);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Dilip Kumar (#4)
Re: standby promotion can create unreadable WAL

Nice find!

At Wed, 24 Aug 2022 11:09:44 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in

On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

Nothing that uses xlogreader is going to be able to bridge the gap
between file #4 and file #5. In this case it doesn't matter very much,
because we immediately write a checkpoint record into file #5, so if
we crash we won't try to replay file #4 anyway. However, if anything
did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

Yeah, this analysis looks correct to me.

(I didn't reproduce the case but understand what is happening.)

Me, too. There are two ways to deal with this, I think. One is start
writing new records from abortedContRecPtr as if it were not
exist. Another is copying WAL file up to missingContRecPtr. Since the
first segment of the new timeline doesn't need to be identcal to the
last one of the previous timeline, so I think the former way is
cleaner. XLogInitNewTimeline or near seems to be be the place for fix
to me. Clearing abortedRecPtr and missingContrecPtr just before the
call to findNewestTimeLine will work?

====
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87b243e0d4..27e01153e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5396,6 +5396,13 @@ StartupXLOG(void)
 		 */
 		XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
+		/*
+		 * EndOfLog doesn't cover aborted contrecord even if the last record
+		 * was that, then the next timeline starts writing from there. Forget
+		 * about aborted and missing contrecords even if any.
+		 */
+		abortedRecPtr = missingContrecPtr = InvalidXLogRecPtr;
+
 		/*
 		 * Remove the signal files out of the way, so that we don't
 		 * accidentally re-enter archive recovery mode in a subsequent crash.
====

I think that the cause of this problem is this code right here:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Yeah, this statement as well as another statement that creates the
overwrite contrecord. After changing these two lines the problem is
fixed for me. Although I haven't yet thought of all the scenarios
that whether it is safe in all the cases. I agree that after timeline
changes we are pointing to the end of the last valid record we can
start writing the next record from that point onward. But I think we
should need to think hard that whether it will break any case for
which the overwrite contrecord was actually introduced.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 7602fc8..3d38613 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5491,7 +5491,7 @@ StartupXLOG(void)
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
-       if (!XLogRecPtrIsInvalid(missingContrecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
@@ -5589,7 +5589,7 @@ StartupXLOG(void)
LocalSetXLogInsertAllowed();
/* If necessary, write overwrite-contrecord before doing
anything else */
-       if (!XLogRecPtrIsInvalid(abortedRecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(abortedRecPtr))
{
Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
CreateOverwriteContrecordRecord(abortedRecPtr,
missingContrecPtr, newTLI);

This also seems to work, of course.

# However, I haven't managed to reproduce that, yet...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#5)
Re: standby promotion can create unreadable WAL

On Wed, Aug 24, 2022 at 4:40 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Me, too. There are two ways to deal with this, I think. One is start
writing new records from abortedContRecPtr as if it were not
exist. Another is copying WAL file up to missingContRecPtr. Since the
first segment of the new timeline doesn't need to be identcal to the
last one of the previous timeline, so I think the former way is
cleaner.

I agree, mostly because that gets us back to the way all of this
worked before the contrecord stuff went in. This case wasn't broken
then, because the breakage had to do with it being unsafe to back up
and rewrite WAL that might have already been shipped someplace, and
that's not an issue when we're first creating a totally new timeline.
It seems safer to me to go back to the way this worked before the fix
went in than to change over to a new system.

Honestly, in a vacuum, I might prefer to get rid of this thing where
the WAL segment gets copied over from the old timeline to the new, and
just always switch TLIs at segment boundaries. And while we're at it,
I'd also like TLIs to be 64-bit random numbers instead of integers
assigned in ascending order. But those kinds of design changes seem
best left for a future master-only development effort. Here, we need
to back-patch the fix, and should try to just unbreak what's currently
broken.

XLogInitNewTimeline or near seems to be be the place for fix
to me. Clearing abortedRecPtr and missingContrecPtr just before the
call to findNewestTimeLine will work?

Hmm, yeah, that seems like a good approach.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#1)
Re: standby promotion can create unreadable WAL

On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

However, if anything

did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

I tried in streaming mode, but it seems in the streaming mode we will
never create this bogus file because of this check [1]ReadRecord{ ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg); if (record == NULL) { /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we’ll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; }. So if the
StandbyMode is true then we are never setting "abortedRecPtr" and
"missingContrecPtr" which means we will never create that 0-filled gap
in the WAL file that we are discussing in this thread.

Do we need to set it? I feel we don't. Why? because on this thread
we are also discussing that if the timeline switches then we don’t
need to create that 0-filled gap and that is the actual problem we are
discussing here. And we know that if we are coming out of the
StandbyMode then we will always switch the timeline so we don’t create
that 0-filled gap. OTOH if we are coming out of the archive recovery
then also we will switch the timeline so in that case also we do not
need that. So practically we need to 0 fill that partial record only
when we are doing the crash recovery is that understanding correct?
If so then we can simply avoid setting these variables if
ArchiveRecoveryRequested is true. So in the below check[1]ReadRecord{ ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg); if (record == NULL) { /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we’ll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; } instead of
(!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then
we don't need any other fix. Am I missing anything?

[1]: ReadRecord{ ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg); if (record == NULL) { /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we’ll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; }
ReadRecord{
..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
if (record == NULL)
{
/*
* When not in standby mode we find that WAL ends in an incomplete
* record, keep track of that record. After recovery is done,
* we’ll write a record to indicate to downstream WAL readers that
* that portion is to be ignored.
*/
if (!StandbyMode &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
}

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#7)
1 attachment(s)
Re: standby promotion can create unreadable WAL

On Fri, Aug 26, 2022 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

ArchiveRecoveryRequested is true. So in the below check[1] instead of
(!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then
we don't need any other fix. Am I missing anything?

[1]
ReadRecord{
..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
if (record == NULL)
{
/*
* When not in standby mode we find that WAL ends in an incomplete
* record, keep track of that record. After recovery is done,
* we’ll write a record to indicate to downstream WAL readers that
* that portion is to be ignored.
*/
if (!StandbyMode &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
}

I agree. Testing StandbyMode here seems bogus. I thought initially
that the test should perhaps be for InArchiveRecovery rather than
ArchiveRecoveryRequested, but I see that the code which switches to a
new timeline cares about ArchiveRecoveryRequested, so I think that is
the correct thing to test here as well.

Concretely, I propose the following patch.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

fix-contrecord-condition-v1.patchapplication/octet-stream; name=fix-contrecord-condition-v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 537845cada..fbf2d34eef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5433,6 +5433,14 @@ StartupXLOG(void)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
+		/*
+		 * We should only have a missingContrecPtr if we're not switching to
+		 * a new timeline. When a timeline switch occurs, WAL is copied from
+		 * the old timeline to the new only up to the end of the last complete
+		 * record, so there can't be an incomplete WAL record that we need to
+		 * disregard.
+		 */
+		Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
 	}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a59a0e826b..f233fa3494 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3024,12 +3024,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		if (record == NULL)
 		{
 			/*
-			 * When not in standby mode we find that WAL ends in an incomplete
-			 * record, keep track of that record.  After recovery is done,
-			 * we'll write a record to indicate to downstream WAL readers that
-			 * that portion is to be ignored.
+			 * When find that WAL ends in an incomplete record, keep track of
+			 * that record.  After recovery is done, we'll write a record to
+			 * indicate to downstream WAL readers that that portion is to be
+			 * ignored.
+			 *
+			 * However, when ArchiveRecoveryRequested = true, we're going to
+			 * switch to a new timeline at the end of recovery. We will only
+			 * copy WAL over to the new timeline up to the end of the last
+			 * complete record, so the new timeline will never end in an
+			 * incomplete record, and thus we should not do this.
 			 */
-			if (!StandbyMode &&
+			if (!ArchiveRecoveryRequested &&
 				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
#9Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Robert Haas (#8)
Re: standby promotion can create unreadable WAL

On 2022-Aug-26, Robert Haas wrote:

I agree. Testing StandbyMode here seems bogus. I thought initially
that the test should perhaps be for InArchiveRecovery rather than
ArchiveRecoveryRequested, but I see that the code which switches to a
new timeline cares about ArchiveRecoveryRequested, so I think that is
the correct thing to test here as well.

Yeah, I think you had already established elsewhere that testing
StandbyMode was the wrong thing to do. Testing ArchiveRecoveryRequested
here seems quite odd at first, but given the copying behavior, I agree
that it seems a correct thing to do.

There's a small typo in the comment: "When find that". I suppose that
was meant to be "When we find that". You end that para with "and thus
we should not do this", but that sounds like it wouldn't matter if we
did. Maybe "and thus doing this would be wrong, so skip it." or
something like that. (Perhaps be even more specific and say "if we did
this, we would later create an overwrite record in the wrong place,
breaking everything")

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#10Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#9)
1 attachment(s)
Re: standby promotion can create unreadable WAL

On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

There's a small typo in the comment: "When find that". I suppose that
was meant to be "When we find that". You end that para with "and thus
we should not do this", but that sounds like it wouldn't matter if we
did. Maybe "and thus doing this would be wrong, so skip it." or
something like that. (Perhaps be even more specific and say "if we did
this, we would later create an overwrite record in the wrong place,
breaking everything")

I think that saying that someone should not do something implies
pretty clearly that it would be bad if they did. But I have no problem
with your more specific language, and as a general rule, it's good to
be specific, so let's use that.

v2 attached.

Thanks for chiming in.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

fix-contrecord-condition-v2.patchapplication/octet-stream; name=fix-contrecord-condition-v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 537845cada..fbf2d34eef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5433,6 +5433,14 @@ StartupXLOG(void)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
+		/*
+		 * We should only have a missingContrecPtr if we're not switching to
+		 * a new timeline. When a timeline switch occurs, WAL is copied from
+		 * the old timeline to the new only up to the end of the last complete
+		 * record, so there can't be an incomplete WAL record that we need to
+		 * disregard.
+		 */
+		Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
 	}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a59a0e826b..4ad145dd16 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3024,12 +3024,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		if (record == NULL)
 		{
 			/*
-			 * When not in standby mode we find that WAL ends in an incomplete
-			 * record, keep track of that record.  After recovery is done,
-			 * we'll write a record to indicate to downstream WAL readers that
-			 * that portion is to be ignored.
+			 * When we find that WAL ends in an incomplete record, keep track
+			 * of that record.  After recovery is done, we'll write a record to
+			 * indicate to downstream WAL readers that that portion is to be
+			 * ignored.
+			 *
+			 * However, when ArchiveRecoveryRequested = true, we're going to
+			 * switch to a new timeline at the end of recovery. We will only
+			 * copy WAL over to the new timeline up to the end of the last
+			 * complete record, so if we did this, we would later create an
+			 * overwrite contrecord in the wrong place, breaking everything.
 			 */
-			if (!StandbyMode &&
+			if (!ArchiveRecoveryRequested &&
 				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
#11Imseih (AWS), Sami
simseih@amazon.com
In reply to: Robert Haas (#8)
Re: standby promotion can create unreadable WAL

I agree. Testing StandbyMode here seems bogus. I thought initially
that the test should perhaps be for InArchiveRecovery rather than
ArchiveRecoveryRequested, but I see that the code which switches to a
new timeline cares about ArchiveRecoveryRequested, so I think that is
the correct thing to test here as well.

Concretely, I propose the following patch.

This patch looks similar to the change suggested in
/messages/by-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3@amazon.com
to deal with panics after promoting a standby.

The difference is the patch tests !ArchiveRecoveryRequested instead
of !StandbyModeRequested as proposed in the mentioned thread.

Thanks
--
Sami Imseih
Amazon Web Services

#12Robert Haas
robertmhaas@gmail.com
In reply to: Imseih (AWS), Sami (#11)
Re: standby promotion can create unreadable WAL

On Fri, Aug 26, 2022 at 11:59 AM Imseih (AWS), Sami <simseih@amazon.com> wrote:

I agree. Testing StandbyMode here seems bogus. I thought initially
that the test should perhaps be for InArchiveRecovery rather than
ArchiveRecoveryRequested, but I see that the code which switches to a
new timeline cares about ArchiveRecoveryRequested, so I think that is
the correct thing to test here as well.

Concretely, I propose the following patch.

This patch looks similar to the change suggested in
/messages/by-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3@amazon.com
to deal with panics after promoting a standby.

The difference is the patch tests !ArchiveRecoveryRequested instead
of !StandbyModeRequested as proposed in the mentioned thread.

OK, I didn't realize this bug had been independently discovered and it
looks like I was even involved in the previous discussion. I just
totally forgot about it.

I think, however, that your fix is wrong and this one is right.
Fundamentally, the server is either in normal running, or crash
recovery, or archive recovery. Standby mode is just an optional
behavior of archive recovery, controlling whether or not we keep
retrying once the end of WAL is reached. But there's no reason why the
server should put the contrecord at a different location when recovery
ends depending on that retry behavior. The only thing that matters is
whether we're going to switch timelines.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Imseih (AWS), Sami
simseih@amazon.com
In reply to: Robert Haas (#12)
Re: standby promotion can create unreadable WAL

I think, however, that your fix is wrong and this one is right.
Fundamentally, the server is either in normal running, or crash
recovery, or archive recovery. Standby mode is just an optional
behavior of archive recovery

Good point. Thanks for clearing my understanding.

Thanks
--
Sami Imseih
Amazon Web Services

#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#10)
1 attachment(s)
Re: standby promotion can create unreadable WAL

On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

There's a small typo in the comment: "When find that". I suppose that
was meant to be "When we find that". You end that para with "and thus
we should not do this", but that sounds like it wouldn't matter if we
did. Maybe "and thus doing this would be wrong, so skip it." or
something like that. (Perhaps be even more specific and say "if we did
this, we would later create an overwrite record in the wrong place,
breaking everything")

I think that saying that someone should not do something implies
pretty clearly that it would be bad if they did. But I have no problem
with your more specific language, and as a general rule, it's good to
be specific, so let's use that.

v2 attached.

The patch LGTM, this patch will apply on master and v15. PFA patch
for back branches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix-contrecord-condition-v2_v14_v96.patchapplication/octet-stream; name=fix-contrecord-condition-v2_v14_v96.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b76451f168..994be2678f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4098,12 +4098,18 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		if (record == NULL)
 		{
 			/*
-			 * When not in standby mode we find that WAL ends in an incomplete
-			 * record, keep track of that record.  After recovery is done,
-			 * we'll write a record to indicate downstream WAL readers that
-			 * that portion is to be ignored.
+			 * When we find that WAL ends in an incomplete record, keep track
+			 * of that record.  After recovery is done, we'll write a record to
+			 * indicate to downstream WAL readers that that portion is to be
+			 * ignored.
+			 *
+			 * However, when ArchiveRecoveryRequested = true, we're going to
+			 * switch to a new timeline at the end of recovery. We will only
+			 * copy WAL over to the new timeline up to the end of the last
+			 * complete record, so if we did this, we would later create an
+			 * overwrite contrecord in the wrong place, breaking everything.
 			 */
-			if (!StandbyMode &&
+			if (!ArchiveRecoveryRequested &&
 				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
@@ -7387,6 +7393,14 @@ StartupXLOG(void)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
+		/*
+		 * We should only have a missingContrecPtr if we're not switching to
+		 * a new timeline. When a timeline switch occurs, WAL is copied from
+		 * the old timeline to the new only up to the end of the last complete
+		 * record, so there can't be an incomplete WAL record that we need to
+		 * disregard.
+		 */
+		Assert(ThisTimeLineID == PrevTimeLineID);
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
 	}
#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Dilip Kumar (#14)
1 attachment(s)
Re: standby promotion can create unreadable WAL

At Sun, 28 Aug 2022 10:16:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in

On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

v2 attached.

The patch LGTM, this patch will apply on master and v15. PFA patch
for back branches.

StandbyMode is obviously wrong. On the other hand I thought that
!ArchiveRecoveryRequested is somewhat wrong, too (, as I stated in the
pointed thread). On second thought, I changed my mind that it is
right. After aborted contrec is found, The cause of the confusion is
that I somehow thought that archive recovery continues from the
aborted-contrec record. However, that assumption is wrong. The next
redo starts from the beginning of the aborted contrecord so we should
forget abouat the old missing/aborted contrec info when archive
recovery is requested.

In the end, the point is that we need to set the global variables only
when XLogPrefetcherReadRecord() (or XLogReadRecord()) returns NULL and
we return it to the caller. Is it worth to do a small refactoring
like the attached? If no, I'm fine with the proposed patch including
the added assertion.

# I havent reproduce the issue of the OP in the other thread yet, and
# also not found how to reproduce this issue, though..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

refactor_abrtcontrec_set.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a59a0e826b..d354701423 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3023,19 +3023,6 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
-			/*
-			 * When not in standby mode we find that WAL ends in an incomplete
-			 * record, keep track of that record.  After recovery is done,
-			 * we'll write a record to indicate to downstream WAL readers that
-			 * that portion is to be ignored.
-			 */
-			if (!StandbyMode &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
-			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
-
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -3125,8 +3112,20 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* In standby mode, loop back to retry. Otherwise, give up. */
 			if (StandbyMode && !CheckForStandbyTrigger())
 				continue;
-			else
-				return NULL;
+
+			/*
+			 * We return NULL to the caller.  If the last record has failed
+			 * with a missing contrecord, inform that to the caller as well.
+			 * In other cases we retry from the beginning of the failed record
+			 * so no need to do this.
+			 */
+			if (!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+			{
+				abortedRecPtr = xlogreader->abortedRecPtr;
+				missingContrecPtr = xlogreader->missingContrecPtr;
+			}
+
+			return NULL;
 		}
 	}
 }
#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#15)
Re: standby promotion can create unreadable WAL

At Mon, 29 Aug 2022 13:13:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

we return it to the caller. Is it worth to do a small refactoring
like the attached? If no, I'm fine with the proposed patch including
the added assertion.

Mmm. That seems wrong. So forget about that. The proposed patch looks
fine to me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#7)
1 attachment(s)
Re: standby promotion can create unreadable WAL

On Fri, Aug 26, 2022 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

However, if anything

did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

I tried to see the problem with the cascading standby, basically the
setup is like below
pgprimary->pgstandby(archive only)->pgcascade(streaming + archive).

The second node has to be archive only because this 0 filled gap is
created in archive only mode. With that I have noticed that the when
cascading standby is getting that 0 filled gap it report same error
what we seen with pg_waldump and that it keep waiting forever on that
file. I have attached a test case, but I think timing is not done
perfectly in this test so before the cascading standby setup some of
the WAL file get removed by the pgstandby so I just put direct return
in RemoveOldXlogFiles() to test this[2]diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index eb5115f..990a879 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3558,6 +3558,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr, XLogSegNo endlogSegNo; XLogSegNo recycleSegNo;. And this problem is getting
resolved with the patch given by Robert upthread.

[1]: 2022-08-25 16:21:26.413 IST [18235] LOG: invalid record length at 0/FFFFEA8: wanted 24, got 0
2022-08-25 16:21:26.413 IST [18235] LOG: invalid record length at
0/FFFFEA8: wanted 24, got 0

[2]
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index eb5115f..990a879 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3558,6 +3558,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr
lastredoptr, XLogRecPtr endptr,
        XLogSegNo       endlogSegNo;
        XLogSegNo       recycleSegNo;

+ return;
/* Initialize info about where to try to recycle to */
XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
recycleSegNo = XLOGfileslop(lastredoptr);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

test_contrewrite_cascade.shtext/x-sh; charset=US-ASCII; name=test_contrewrite_cascade.shDownload
#18Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#16)
Re: standby promotion can create unreadable WAL

On Mon, Aug 29, 2022 at 12:21 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Mmm. That seems wrong. So forget about that. The proposed patch looks
fine to me.

Thanks for thinking it over. Committed and back-patched as far as v10,
since that's the oldest supported release.

--
Robert Haas
EDB: http://www.enterprisedb.com