Infinite loop on master shutdown
Hello, as in pgsql-bug ML.
/messages/by-id/20180517.170021.24356216.horiguchi.kyotaro@lab.ntt.co.jp
Master can go into infinite loop on shutdown. But it is caused by
a broken database like storage rolled-back one. (The steps to
replay this is shown in the above mail.)
I think this can be avoided by rejecting a standby if it reports
that write LSN is smaller than flush LSN after catching up.
Is it worth fixing?
# The patch is slightly different from that I posted to -bugs.
It is enough to chek for the invalid state just once but the
patch continues the check.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
reject_invalid_standby.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e47ddca6bc..1916acf629 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1809,6 +1809,19 @@ ProcessStandbyReplyMessage(void)
if (replyRequested)
WalSndKeepalive(false);
+ /*
+ * Once this stream catches up to WAL, writePtr cannot be smaller than
+ * flushPtr. Otherwise we haven't reached the standby's starting LSN thus
+ * this database cannot be a proper master of the standby. The state
+ * causes infinite loop on shutdown.
+ */
+ if (MyWalSnd->state >= WALSNDSTATE_CATCHUP &&
+ writePtr != InvalidXLogRecPtr && writePtr < flushPtr)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("Standby started from the future LSN for this server"),
+ errhint("This means that the standby is not created from this database.")));
+
/*
* Update shared state for this WalSender process based on reply data from
* standby.
Hi,
On 2018-05-17 17:19:00 +0900, Kyotaro HORIGUCHI wrote:
Hello, as in pgsql-bug ML.
/messages/by-id/20180517.170021.24356216.horiguchi.kyotaro@lab.ntt.co.jp
Master can go into infinite loop on shutdown. But it is caused by
a broken database like storage rolled-back one. (The steps to
replay this is shown in the above mail.)I think this can be avoided by rejecting a standby if it reports
that write LSN is smaller than flush LSN after catching up.Is it worth fixing?
I'm very doubtful. If you do bad stuff to a standby, bad things can
happen...
Greetings,
Andres Freund
At Thu, 17 May 2018 09:20:01 -0700, Andres Freund <andres@anarazel.de> wrote in <20180517162001.rzd7l6g2h66hvzvd@alap3.anarazel.de>
Hi,
On 2018-05-17 17:19:00 +0900, Kyotaro HORIGUCHI wrote:
Hello, as in pgsql-bug ML.
/messages/by-id/20180517.170021.24356216.horiguchi.kyotaro@lab.ntt.co.jp
Master can go into infinite loop on shutdown. But it is caused by
a broken database like storage rolled-back one. (The steps to
replay this is shown in the above mail.)I think this can be avoided by rejecting a standby if it reports
that write LSN is smaller than flush LSN after catching up.Is it worth fixing?
I'm very doubtful. If you do bad stuff to a standby, bad things can
happen...
Yes, I doubted its worthiness since I didn't find more natural
way to cause that.
Thanks for the opinion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center