pg_promote() can cause busy loop

Started by Fujii Masaoover 6 years ago7 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

I found small issue in pg_promote(). If postmaster dies
while pg_promote() is waiting for the standby promotion to finish,
pg_promote() can cause busy loop. This happens because
pg_promote() does nothing when WaitLatch() detects
the postmaster death event. I think that pg_promote()
should bail out of the loop immediately in that case.

Attached is the patch for the fix.

Regards,

--
Fujii Masao

Attachments:

pg_promote_and_postmaster_death.patchapplication/octet-stream; name=pg_promote_and_postmaster_death.patchDownload+22-20
#2Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#1)
Re: pg_promote() can cause busy loop

On Thu, Sep 05, 2019 at 09:46:26AM +0900, Fujii Masao wrote:

I found small issue in pg_promote(). If postmaster dies
while pg_promote() is waiting for the standby promotion to finish,
pg_promote() can cause busy loop. This happens because
pg_promote() does nothing when WaitLatch() detects
the postmaster death event. I think that pg_promote()
should bail out of the loop immediately in that case.

Attached is the patch for the fix.

Indeed, this is not correct.

-   ereport(WARNING,
-           (errmsg("server did not promote within %d seconds",
-           wait_seconds)));
+   if (i >= WAITS_PER_SECOND * wait_seconds)
+       ereport(WARNING,
+               (errmsg("server did not promote within %d seconds", wait_seconds)));

Would it make more sense to issue a warning mentioning the postmaster
death and then return PG_RETURN_BOOL(false) instead of breaking out of
the loop? It could be confusing to warn about a timeout if the
postmaster died in parallel, and we know the actual reason why the
promotion did not happen in this case.
--
Michael

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#2)
Re: pg_promote() can cause busy loop

On Thu, Sep 5, 2019 at 10:26 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 05, 2019 at 09:46:26AM +0900, Fujii Masao wrote:

I found small issue in pg_promote(). If postmaster dies
while pg_promote() is waiting for the standby promotion to finish,
pg_promote() can cause busy loop. This happens because
pg_promote() does nothing when WaitLatch() detects
the postmaster death event. I think that pg_promote()
should bail out of the loop immediately in that case.

Attached is the patch for the fix.

Indeed, this is not correct.

-   ereport(WARNING,
-           (errmsg("server did not promote within %d seconds",
-           wait_seconds)));
+   if (i >= WAITS_PER_SECOND * wait_seconds)
+       ereport(WARNING,
+               (errmsg("server did not promote within %d seconds", wait_seconds)));

Would it make more sense to issue a warning mentioning the postmaster
death and then return PG_RETURN_BOOL(false) instead of breaking out of
the loop? It could be confusing to warn about a timeout if the
postmaster died in parallel, and we know the actual reason why the
promotion did not happen in this case.

It's ok to use PG_RETURN_BOOL(false) instead of breaking out of the loop
in that case. Which would make the code simpler.

But I don't think it's worth warning about postmaster death here
because a backend will emit FATAL message like "terminating connection
due to unexpected postmaster exit" in secure_read() after
pg_promote() returns false.

Regards,

--
Fujii Masao

#4Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#3)
Re: pg_promote() can cause busy loop

On Thu, Sep 05, 2019 at 10:53:19AM +0900, Fujii Masao wrote:

It's ok to use PG_RETURN_BOOL(false) instead of breaking out of the loop
in that case. Which would make the code simpler.

Okay. I would have done so FWIW.

But I don't think it's worth warning about postmaster death here
because a backend will emit FATAL message like "terminating connection
due to unexpected postmaster exit" in secure_read() after
pg_promote() returns false.

Good point, that could be equally confusing.
--
Michael

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#4)
Re: pg_promote() can cause busy loop

On Thu, Sep 5, 2019 at 11:10 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 05, 2019 at 10:53:19AM +0900, Fujii Masao wrote:

It's ok to use PG_RETURN_BOOL(false) instead of breaking out of the loop
in that case. Which would make the code simpler.

Okay. I would have done so FWIW.

But I don't think it's worth warning about postmaster death here
because a backend will emit FATAL message like "terminating connection
due to unexpected postmaster exit" in secure_read() after
pg_promote() returns false.

Good point, that could be equally confusing.

So, barring any objection, I will commit the attached patch.

Regards,

--
Fujii Masao

Attachments:

pg_promote_and_postmaster_death_v2.patchapplication/octet-stream; name=pg_promote_and_postmaster_death_v2.patchDownload+17-15
#6Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#5)
Re: pg_promote() can cause busy loop

On Thu, Sep 05, 2019 at 04:03:22PM +0900, Fujii Masao wrote:

So, barring any objection, I will commit the attached patch.

LGTM. Thanks!
--
Michael

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#6)
Re: pg_promote() can cause busy loop

On Thu, Sep 5, 2019 at 4:52 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 05, 2019 at 04:03:22PM +0900, Fujii Masao wrote:

So, barring any objection, I will commit the attached patch.

LGTM. Thanks!

Committed. Thanks!

Regards,

--
Fujii Masao