BUG #16817: kill process cause postmaster hang

Started by PG Bug reporting formover 5 years ago5 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16817
Logged by: Bo Chen
Email address: bchen90@163.com
PostgreSQL version: 11.8
Operating system: euleros v2r7 x86_64
Description:

Hi hackers

Recently we encountered a problem that after killed walwriter, we expect
the database can recover normally, but it not (the postmaster hang in the
stat of 'wait dead end',and the archiver does't exit).
After analysis this problem, we found it could be a bug for a long time.
for archiver now use 'system' to call the configed archive command. For
'system' the linux programmer's manual describe the following 'During
execution of the command, SIGCHLD will be blocked, and SIGINT and SIGQUIT
will be ignored'.

So, when a child chrash, we now just SIGQUIT the archiver just one time,
while the archiver just execute 'system', SIGQUIT will be ignored, then the
posmaster hang in stat of 'wait dead end'.

For this porblem, we now added a SIGUSR2 for archiver after SIGQUIT for
HandleChildCrash. If there any other solution?

regards,ChenBo

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #16817: kill process cause postmaster hang

PG Bug reporting form <noreply@postgresql.org> writes:

Recently we encountered a problem that after killed walwriter, we expect
the database can recover normally, but it not (the postmaster hang in the
stat of 'wait dead end', and the archiver does't exit).
After analysis this problem, we found it could be a bug for a long time.
for archiver now use 'system' to call the configed archive command. For
'system' the linux programmer's manual describe the following 'During
execution of the command, SIGCHLD will be blocked, and SIGINT and SIGQUIT
will be ignored'.

So, when a child chrash, we now just SIGQUIT the archiver just one time,
while the archiver just execute 'system', SIGQUIT will be ignored, then the
posmaster hang in stat of 'wait dead end'.

Not sure I believe this: why wouldn't the SIGKILL-after-5-seconds logic
get us out of that situation?

regards, tom lane

#3bchen90
bchen90@163.com
In reply to: Tom Lane (#2)
Re: BUG #16817: kill process cause postmaster hang

Hi, tom

Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds
logic"?

regards, chenbo

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html

#4Andy Fan
zhihui.fan1213@gmail.com
In reply to: bchen90 (#3)
Re: BUG #16817: kill process cause postmaster hang

On Mon, Jan 25, 2021 at 9:01 AM bchen90 <bchen90@163.com> wrote:

Hi, tom

Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds
logic"?

regards, chenbo

82233ce7ea42d6ba519aaec63008aff49da6c7af should be the commit Tom was
talking about.

commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri Jun 28 17:20:53 2013 -0400

Send SIGKILL to children if they don't die quickly in immediate shutdown

On immediate shutdown, or during a restart-after-crash sequence,
postmaster used to send SIGQUIT (and then abandon ship if shutdown); but
this is not a good strategy if backends don't die because of that
signal. (This might happen, for example, if a backend gets tangled
trying to malloc() due to gettext(), as in an example illustrated by
MauMau.) This causes problems when later trying to restart the server,
because some processes are still attached to the shared memory segment.

Instead of just abandoning such backends to their fates, we now have
postmaster hang around for a little while longer, send a SIGKILL after
some reasonable waiting period, and then exit. This makes immediate
shutdown more reliable.

There is disagreement on whether it's best for postmaster to exit after
sending SIGKILL, or to stick around until all children have reported
death. If this controversy is resolved differently than what this patch
implements, it's an easy change to make.

Bug reported by MauMau in message
20DAEA8949EC4E2289C6E8E58560DEC0@maumau

MauMau and Álvaro Herrera

--
Best Regards
Andy Fan (https://www.aliyun.com/)

#5Michael Paquier
michael@paquier.xyz
In reply to: bchen90 (#3)
Re: BUG #16817: kill process cause postmaster hang

On Sun, Jan 24, 2021 at 06:01:04PM -0700, bchen90 wrote:

Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds
logic"?

You are looking for the changes related to this command, as of
postmaster.c:
git grep SIGKILL_CHILDREN_AFTER_SECS
--
Michael