Possible crash on standby

Started by Kyotaro Horiguchiover 3 years ago4 messages
#1Kyotaro Horiguchi
horikyota.ntt@gmail.com
1 attachment(s)

Hello.

While I played with some patch, I met an assertion failure.

#2 0x0000000000b350e0 in ExceptionalCondition (
conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()",
errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c",
lineNumber=4190) at assert.c:69
#3 0x0000000000586f9c in XLogFileRead (segno=61, emode=13, tli=1,
source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190
#4 0x00000000005871d2 in XLogFileReadAnyTLI (segno=61, emode=13,
source=XLOG_FROM_ANY) at xlogrecovery.c:4296
#5 0x000000000058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360,
randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1,
replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727

This is replayable by the following steps.

1. insert a sleep(1) in WaitForWALToBecomeAvailable().

* WAL that we restore from archive.
*/
+ sleep(1);
if (WalRcvStreaming())
XLogShutdownWalRcv();

2. create a primary with archiving enabled.

3. create a standby with recovering from the primary's archive and
unconnectable primary_conninfo.

4. start the primary.

5. switch wal on the primary.

6. Kaboom.

This is because WaitForWALToBecomeAvailable doesn't call
XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
the WalRcvStreaming() call cited above. But we need to set
InstasllXLogFileSegmentActive to false even in that case, since no one
other than startup process does that.

Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
need to correct the dependencies between the flag and walreceiver
state, but it not mandatory because XLogShutdownWalRcv() is designed
so that it can be called even after walreceiver is stopped. I don't
have a clear memory about why we do that at the time, though, but
recovery check runs successfully with this.

This code was introduced at PG12.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Do-not-skip-calling-XLogShutdownWalRcv.patchtext/x-patch; charset=us-asciiDownload
From 12100bb9c041c660eaad675d8e6ed69b49270009 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 9 Sep 2022 17:06:47 +0900
Subject: [PATCH v1] Do not skip calling XLogShutdownWalRcv()

XLogShutdownWalRcv() only stops wal receiver but also flips down the
InstallXLogFileSegmentActive flag.  The function is not called when
entering archive recovery mode with walreceiver already stopped. Thus
archive recovery afterwards faces an assertion failure on the flag in
that case.

Fix this by always calling the function to keep
InstallXLogFileSegmentActive consistent.
---
 src/backend/access/transam/xlogrecovery.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9a80084a68..20bd1f1d2e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3532,8 +3532,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * walreceiver is not active, so that it won't overwrite
 					 * WAL that we restore from archive.
 					 */
-					if (WalRcvStreaming())
-						XLogShutdownWalRcv();
+					XLogShutdownWalRcv();
 
 					/*
 					 * Before we sleep, re-scan for possible new timelines if
-- 
2.31.1

#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#1)
Re: Possible crash on standby

On Fri, Sep 09, 2022 at 05:29:49PM +0900, Kyotaro Horiguchi wrote:

This is because WaitForWALToBecomeAvailable doesn't call
XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
the WalRcvStreaming() call cited above. But we need to set
InstasllXLogFileSegmentActive to false even in that case, since no one
other than startup process does that.

Nice find.

Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
need to correct the dependencies between the flag and walreceiver
state, but it not mandatory because XLogShutdownWalRcv() is designed
so that it can be called even after walreceiver is stopped. I don't
have a clear memory about why we do that at the time, though, but
recovery check runs successfully with this.

I suppose the alternative would be to set InstallXLogFileSegmentActive to
false in an 'else' block, but that doesn't seem necessary if
XLogShutdownWalRcv() is safe to call unconditionally. So, unless there is
a bigger problem that I'm not seeing, +1 for your patch.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#1)
Re: Possible crash on standby

On Fri, Sep 9, 2022 at 2:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Hello.

While I played with some patch, I met an assertion failure.

#2 0x0000000000b350e0 in ExceptionalCondition (
conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()",
errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c",
lineNumber=4190) at assert.c:69
#3 0x0000000000586f9c in XLogFileRead (segno=61, emode=13, tli=1,
source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190
#4 0x00000000005871d2 in XLogFileReadAnyTLI (segno=61, emode=13,
source=XLOG_FROM_ANY) at xlogrecovery.c:4296
#5 0x000000000058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360,
randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1,
replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727

This is replayable by the following steps.

1. insert a sleep(1) in WaitForWALToBecomeAvailable().

* WAL that we restore from archive.
*/
+ sleep(1);
if (WalRcvStreaming())
XLogShutdownWalRcv();

2. create a primary with archiving enabled.

3. create a standby with recovering from the primary's archive and
unconnectable primary_conninfo.

4. start the primary.

5. switch wal on the primary.

6. Kaboom.

This is because WaitForWALToBecomeAvailable doesn't call
XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
the WalRcvStreaming() call cited above. But we need to set
InstasllXLogFileSegmentActive to false even in that case, since no one
other than startup process does that.

Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
need to correct the dependencies between the flag and walreceiver
state, but it not mandatory because XLogShutdownWalRcv() is designed
so that it can be called even after walreceiver is stopped. I don't
have a clear memory about why we do that at the time, though, but
recovery check runs successfully with this.

This code was introduced at PG12.

I think it is a duplicate of [1]/messages/by-id/CALj2ACXPn_xePphnh88qmoQqqW+E2KEOdxGL+D-o9o7_XNGkkw@mail.gmail.com. I have tested the above use-case
with the patch at [1]/messages/by-id/CALj2ACXPn_xePphnh88qmoQqqW+E2KEOdxGL+D-o9o7_XNGkkw@mail.gmail.com and it fixes the issue.

[1]: /messages/by-id/CALj2ACXPn_xePphnh88qmoQqqW+E2KEOdxGL+D-o9o7_XNGkkw@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#3)
Re: Possible crash on standby

On Fri, Sep 09, 2022 at 10:51:10PM +0530, Bharath Rupireddy wrote:

I think it is a duplicate of [1]. I have tested the above use-case
with the patch at [1] and it fixes the issue.

I added this thread to the existing commitfest entry. Thanks for pointing
this out.

https://commitfest.postgresql.org/39/3814

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com