[RFC] Should we fix postmaster to avoid slow shutdown?

Started by Tsunakawa, Takayukiover 9 years ago60 messageshackers
Jump to latest
#1Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com

Hello,

Please let me ask you about possible causes of a certain problem, slow shutdown of postmaster when a backend crashes, and whether to fix PostgreSQL.

Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4. Yes, the PostgreSQL version is rather old but there's no relevant bug fix in later 9.2.x releases.

PROBLEM
==============================

One backend process (postgres) for an application session crashed due to a segmentation fault and dumped a core file. The cause is a bug of pg_dbms_stats. Another note is that restart_after_crash is off to make failover happen.

The problem here is that postmaster took as long as 15 seconds to terminate after it had detected a crashed backend. The messages were output as follows:

20:12:35.004に
LOG: server process (PID 31894) was terminated by signal 11: Segmentation fault
DETAIL: Failed process was running: DELETE...(snip)
LOG: terminating any other active server processes

From 20:12:35.013 to 20:12:39.074, the following message was output 80 times.

FATAL: the database system is in recovery mode

20:12:50
The custom monitoring system detected the death of postmaster as a result of running "pg_ctl status".

That's it. You may say the following message should also have been emitted, but there's not. This is because we commented out the ereport() call in quickdie() in tcop.c. That ereport() call can hang depending on the timing, which is fixed in 9.4.

WARNING: terminating connection because of crash of another server process

The customer insists that PostgreSQL takes longer to shut down than expected, which risks exceeding their allowed failover time.

CAUSE
==============================

There's no apparent evidence to indicate the cause, but I could guess a few reasons. What do you think these are correct and should fix PostgreSQL? (I think so)

1) postmaster should close the listening ports earlier
As cited above, for 4 seconds, postmaster created 80 dead-end child processes which just output "FATAL: the database system is in recovery mode". This indicates that postmaster is busy handling re-connection requests from disconnected applications, preventing postmaster from reaping dead children as fast as possible. This is a waste because postmaster will only shut down.

I think the listening ports should be closed in HandleChildCrash() when the condition "(RecoveryError || !restart_after_crash)" is true.

2) make stats collector terminate immediately
stats collector seems to write the permanent stats file even when it receives SIGQUIT. But it's useless because the stat file is reset during recovery. And Tom claimed that writing stats file can take long:

/messages/by-id/11800.1455135203@sss.pgh.pa.us

3) Anything else?
While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports open but doesn't call select()/accept(). As a result, incoming connection requests are accumulated in the listen queue of the sockets. Does the OS have any bug to slow the process termination when the listen queue is not empty?

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#1)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

There's no apparent evidence to indicate the cause, but I could guess a few reasons. What do you think these are correct and should fix PostgreSQL? (I think so)

I think that we shouldn't start changing things based on guesses about
what the problem is, even if they're fairly smart guesses. The thing
to do would be to construct a test rig, crash the server repeatedly,
and add debugging instrumentation to figure out where the time is
actually going.

I do think your theory about the stats collector might be worth
pursuing. It seems that the stats collector only responds to SIGQUIT,
ignoring SIGTERM. Making it do a clean shutdown on SIGTERM and a fast
exit on SIGQUIT seems possibly worthwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#2)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

There's no apparent evidence to indicate the cause, but I could guess
a few reasons. What do you think these are correct and should fix
PostgreSQL? (I think so)

I think that we shouldn't start changing things based on guesses about what
the problem is, even if they're fairly smart guesses. The thing to do would
be to construct a test rig, crash the server repeatedly, and add debugging
instrumentation to figure out where the time is actually going.

We have tried to reproduce the problem in the past several days with much more stress on our environment than on the customer's one -- 1,000 tables aiming for a dozens of times larger stats file and repeated reconnection requests from hundreds of clients -- but we could not succeed.

I do think your theory about the stats collector might be worth pursuing.
It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM.
Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems
possibly worthwhile.

Thank you for giving confidence for proceeding. And I also believe that postmaster should close the listening ports earlier. Regardless of whether this problem will be solved not confident these will solve the, I think it'd be better to fix these two points so that postmaster doesn't longer time than necessary. I think I'll create a patch after giving it a bit more thought.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#3)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
I think that we shouldn't start changing things based on guesses about what
the problem is, even if they're fairly smart guesses. The thing to do would
be to construct a test rig, crash the server repeatedly, and add debugging
instrumentation to figure out where the time is actually going.

We have tried to reproduce the problem in the past several days with much more stress on our environment than on the customer's one -- 1,000 tables aiming for a dozens of times larger stats file and repeated reconnection requests from hundreds of clients -- but we could not succeed.

I do think your theory about the stats collector might be worth pursuing.
It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM.
Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems
possibly worthwhile.

Thank you for giving confidence for proceeding. And I also believe that postmaster should close the listening ports earlier. Regardless of whether this problem will be solved not confident these will solve the, I think it'd be better to fix these two points so that postmaster doesn't longer time than necessary. I think I'll create a patch after giving it a bit more thought.

FWIW, I'm pretty much -1 on messing with the timing of the socket close
actions. I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh
any likely benefit there.

Allowing SIGQUIT to prompt fast shutdown of the stats collector seems
sane, though. Try to make sure it doesn't leave partly-written stats
files behind.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#4)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane,
though. Try to make sure it doesn't leave partly-written stats files
behind.

The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses 9.2.

I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK?

1. The recovery at the next restart will remove the stat files.
2. SIGQUIT processing should be as fast as possible.
3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise.

FWIW, I'm pretty much -1 on messing with the timing of the socket close
actions. I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh any
likely benefit there.

Wasn't it related to TouchSocketFiles()? Can I see the discussion on this ML? I don't see any problem looking at the code...

Regards
Takayuki Tsunakawa

Attachments:

01_pgstat_avoid_writing_on_sigquit.patchapplication/octet-stream; name=01_pgstat_avoid_writing_on_sigquit.patchDownload+23-14
#6Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#4)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane

FWIW, I'm pretty much -1 on messing with the timing of the socket close

actions. I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh any
likely benefit there.

I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this is OK?

Regards
Takayuki Tsunakawa

Attachments:

02_close_listen_ports_early.patchapplication/octet-stream; name=02_close_listen_ports_early.patchDownload+48-13
#7Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#6)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Tue, Oct 4, 2016 at 5:05 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane

FWIW, I'm pretty much -1 on messing with the timing of the socket close

actions. I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh any
likely benefit there.

I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this is OK?

I have no opinion on this patch, because I haven't reviewed it, but
note recent commit 3b90e38c5d592ea8ec8236287dd5c749fc041728, which
appears to be semi-related.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#7)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: Robert Haas [mailto:robertmhaas@gmail.com]

I have no opinion on this patch, because I haven't reviewed it, but note
recent commit 3b90e38c5d592ea8ec8236287dd5c749fc041728, which appears to
be semi-related.

Thank you for interesting information. Maybe Tom-san experienced some trouble in creating this patch. Fortunately, this doesn't appear to be related to my patch, because the patch changed the timing of closing listen ports in postmaster children, whereas my patch explicitly closes listen ports in postmaster.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#5)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Wed, Sep 28, 2016 at 8:37 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane,
though. Try to make sure it doesn't leave partly-written stats files
behind.

The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses 9.2.

I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK?

1. The recovery at the next restart will remove the stat files.
2. SIGQUIT processing should be as fast as possible.
3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise.

I agree with the first point.

The patch applies and compiles clean. make check-world is clean.

In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is
missing before PG_SETMASK(). Although there are some SIGQUIT handlers which do
not have that call. But I guess, it will be safer to have it.

Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset()
followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this
difference?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#9)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Wed, Oct 26, 2016 at 7:12 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset()
followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this
difference?

Well, for that, you'd need to look at how postmaster.c treats those
exit codes. exit(2) from a regular backend or background worker will
cause a crash-and-restart cycle; I'm not sure whether the handling for
the stats collector is similar or different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Oct 26, 2016 at 7:12 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset()
followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this
difference?

Well, for that, you'd need to look at how postmaster.c treats those
exit codes. exit(2) from a regular backend or background worker will
cause a crash-and-restart cycle; I'm not sure whether the handling for
the stats collector is similar or different.

I'm fairly sure it's different --- there are different policies for
child processes that aren't connected to shared memory (such as the
stats collector) because we shouldn't force a system-wide restart
when they crash.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Ashutosh Bapat (#9)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is
missing before PG_SETMASK(). Although there are some SIGQUIT handlers which
do not have that call. But I guess, it will be safer to have it.

I didn't add it because pgstat_quickdie() just exits, like some other postmaster children. I thought those processes which are concerned about their termination processing call sigaddset(SIGQUIT), so I went after the processes who aren't. Is this really necessary?

Also, many other SIGQUIT handlers like bgworker_quickdie() call
on_exit_reset() followed by exit(2) instead of just exit(1) in
pgstat_quickdie(). Why is this difference?

As Robert and Tom said, either exit(1) or exit(2) is OK because reaper() handles non-zero exit code the same. Regarding on_proc_reset(), stats collector is not attached to the shared memory and does not register on_proc_exit() callbacks. These situations are the same as the archiver process, so I followed it.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#12)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Thu, Oct 27, 2016 at 7:29 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is
missing before PG_SETMASK(). Although there are some SIGQUIT handlers which
do not have that call. But I guess, it will be safer to have it.

I didn't add it because pgstat_quickdie() just exits, like some other postmaster children. I thought those processes which are concerned about their termination processing call sigaddset(SIGQUIT), so I went after the processes who aren't. Is this really necessary?

Ok. In that case, I think we shouldn't even call PG_SETMASK() similar
to pgarch_exit(). Attached patch removes PG_SETMASK(). Let me know if
it looks good.

Also, many other SIGQUIT handlers like bgworker_quickdie() call
on_exit_reset() followed by exit(2) instead of just exit(1) in
pgstat_quickdie(). Why is this difference?

As Robert and Tom said, either exit(1) or exit(2) is OK because reaper() handles non-zero exit code the same.

Yes, per reaper().
2955 /*
2956 * Was it the statistics collector? If so, just try to start a new
2957 * one; no need to force reset of the rest of the system.
(If fail,
2958 * we'll try again in future cycles of the main loop.)
2959 */
2960 if (pid == PgStatPID)
2961 {
2962 PgStatPID = 0;
2963 if (!EXIT_STATUS_0(exitstatus))
2964 LogChildExit(LOG, _("statistics collector process"),
2965 pid, exitstatus);
2966 if (pmState == PM_RUN)
2967 PgStatPID = pgstat_start();
2968 continue;
2969 }

Regarding on_proc_reset(), stats collector is not attached to the shared memory and does not register on_proc_exit() callbacks. These situations are the same as the archiver process, so I followed it.

Right. I got confused because of on_shmem_exit() call in
pgstat_initialize. But that's per backend code and not executed within
stats collector.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

01_pgstat_avoid_writing_on_sigquit_v2.patchtext/x-patch; charset=US-ASCII; name=01_pgstat_avoid_writing_on_sigquit_v2.patchDownload+16-7
#14Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Ashutosh Bapat (#13)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: Ashutosh Bapat [mailto:ashutosh.bapat@enterprisedb.com]

Ok. In that case, I think we shouldn't even call PG_SETMASK() similar to
pgarch_exit(). Attached patch removes PG_SETMASK(). Let me know if it looks
good.

It looks good. Thanks.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#6)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Tue, Oct 4, 2016 at 2:35 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane

FWIW, I'm pretty much -1 on messing with the timing of the socket close

actions. I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh any
likely benefit there.

I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this is OK?

Few comments on this patch,

I am not sure if following condition is a good idea in ServerLoop()
1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets)

There are no sockets to listen on, so select in the else condition is
going to sleep for timeout determined based on the sequence expected.
Just before we close sockets in pmdie() it sets AbortStartTime, which
determines the timeout for the sleep here. So, it doesn't make sense
to ignore it. Instead may be we should change the default 60s sleep to
100ms sleep in DetermineSleepTime().

If not, we need to at least update the comments to indicate the other
reason for not polling using select().
* If we are in PM_WAIT_DEAD_END state, then we don't want to accept
* any new connections, so we don't call select(), and just sleep.
*/
memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));

-       if (pmState == PM_WAIT_DEAD_END)
+       if (pmState == PM_WAIT_DEAD_END || ClosedSockets)

While the postmaster is terminating children, a new connection request
may arrive. We should probably close listening sockets before
terminating children in pmdie().

Otherwise this patch looks good to me. It applies and compiles
cleanly. make check-world doesn't show any failures.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Ashutosh Bapat (#15)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
I am not sure if following condition is a good idea in ServerLoop()
1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets)

There are no sockets to listen on, so select in the else condition is going
to sleep for timeout determined based on the sequence expected.
Just before we close sockets in pmdie() it sets AbortStartTime, which
determines the timeout for the sleep here. So, it doesn't make sense to
ignore it. Instead may be we should change the default 60s sleep to 100ms
sleep in DetermineSleepTime().

That sounds better. I modified cleaned ServerLoop() by pushing the existing 100ms logic into DetermineSleepTime().

While the postmaster is terminating children, a new connection request may
arrive. We should probably close listening sockets before terminating
children in pmdie().

I moved ClosePostmasterSocket() call before terminating children in the immediate shutdown case. I didn't change the behavior of smart and fast shutdown modes, because they may take a long time to complete due to checkpointing etc. Users will want to know what's happening during shutdown or after pg_ctl stop times out, by getting the message "FATAL: the database system is shutting down" when they try to connect to the database. The immediate shutdown or crash should better be as fast as possible.

Otherwise this patch looks good to me. It applies and compiles cleanly.
make check-world doesn't show any failures.

Thank you for reviewing and testing. The revised patch is attached.

Regards
Takayuki Tsunakawa

Attachments:

02_close_listen_ports_early_v2.patchapplication/octet-stream; name=02_close_listen_ports_early_v2.patchDownload+57-27
#17Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#16)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Mon, Nov 7, 2016 at 10:14 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
I am not sure if following condition is a good idea in ServerLoop()
1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets)

There are no sockets to listen on, so select in the else condition is going
to sleep for timeout determined based on the sequence expected.
Just before we close sockets in pmdie() it sets AbortStartTime, which
determines the timeout for the sleep here. So, it doesn't make sense to
ignore it. Instead may be we should change the default 60s sleep to 100ms
sleep in DetermineSleepTime().

That sounds better. I modified cleaned ServerLoop() by pushing the existing 100ms logic into DetermineSleepTime().

I have changed some comments around this block. See attached patch.
Let me know if that looks good.

While the postmaster is terminating children, a new connection request may
arrive. We should probably close listening sockets before terminating
children in pmdie().

I moved ClosePostmasterSocket() call before terminating children in the immediate shutdown case. I didn't change the behavior of smart and fast shutdown modes, because they may take a long time to complete due to checkpointing etc. Users will want to know what's happening during shutdown or after pg_ctl stop times out, by getting the message "FATAL: the database system is shutting down" when they try to connect to the database. The immediate shutdown or crash should better be as fast as possible.

OK.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

02_close_listen_ports_early_v3.patchtext/x-patch; charset=US-ASCII; name=02_close_listen_ports_early_v3.patchDownload+59-27
#18Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Ashutosh Bapat (#17)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
I have changed some comments around this block. See attached patch.
Let me know if that looks good.

Thanks, it looks good.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#18)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On Mon, Nov 14, 2016 at 5:16 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
I have changed some comments around this block. See attached patch.
Let me know if that looks good.

Thanks, it looks good.

Thanks. The patch 02_close_listen... closes the listen sockets
explicitly when it's known that postmaster is going to stop all the
children and then die. I have tried to see, if there's a possibility
that it closes the listen sockets but do not actually die, thus
causing a server which doesn't accept any connections and doesn't die.
But I have not found that possibility.

I do not have any further comments about both the patches. None else
has volunteered to review the patch till now. So, marking the entry as
ready for committer.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Tsunakawa, Takayuki (#5)
Re: [RFC] Should we fix postmaster to avoid slow shutdown?

On 9/27/16 11:07 PM, Tsunakawa, Takayuki wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane,
though. Try to make sure it doesn't leave partly-written stats files
behind.

The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses 9.2.

I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK?

1. The recovery at the next restart will remove the stat files.
2. SIGQUIT processing should be as fast as possible.
3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise.

ISTM that this would change the "immediate shutdown" to not save stats
files anymore. So far, all the shutdown modes are equivalent in terms
of how they preserve data and system state. They differ only in when
the hard work happens. This would be a deviation from that principle.

Child processes don't distinguish between a SIGQUIT coming from a
user-initiated immediate shutdown request and a crash-induced
kill-everyone directive. So there might not be a non-ugly way to
address that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#20)
#22Peter Eisentraut
peter_e@gmx.net
In reply to: Ashutosh Bapat (#19)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#22)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#23)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#29)
#31Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#28)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#32)
#34Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#32)
#35Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#33)
#36Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#35)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#39)
#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#42)
#46Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#44)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#45)
#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#43)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#49)
#52Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#51)
#53Magnus Hagander
magnus@hagander.net
In reply to: Alvaro Herrera (#52)
#54Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#53)
#55Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#54)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#55)
#57Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#45)
#58Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#49)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#57)
#60Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tsunakawa, Takayuki (#58)