rare avl shutdown slowness (related to signal handling)

Started by Qingqing Zhoualmost 11 years ago9 messages
#1Qingqing Zhou
zhouqq.postgres@gmail.com

I am playing git tip on windows 7/32 bits, with the backend compiled
with visual studio 2005 (I know, it is very old :-( ).

I encountered avl shutdown slowness twice, last night and this
morning: after a ctrl_c is hit, wait for another 60 seconds, server
shuts down. Here is one log:

D:\pgsql\bin>postgres -D../data --log_line_prefix="%t %p"
2015-04-07 10:30:04 PDT 3148LOG: database system was shut down at
2015-04-07 10:29:24 PDT
2015-04-07 10:30:04 PDT 19548LOG: database system is ready to accept
connections
2015-04-07 10:30:04 PDT 27008LOG: autovacuum launcher started
2015-04-07 10:30:08 PDT 19548LOG: received fast shutdown request
2015-04-07 10:30:08 PDT 19548LOG: aborting any active transactions
2015-04-07 10:30:08 PDT 27008LOG: autovacuum launcher shutting down
2015-04-07 10:30:08 PDT 27008ERROR: canceling statement due to user request
2015-04-07 10:31:09 PDT 27008LOG: autovacuum launcher shutting down
2015-04-07 10:31:09 PDT 15656LOG: shutting down
2015-04-07 10:31:09 PDT 15656LOG: database system is shut down

The interesting part is on PID 27008: avl first ereport() shutdown,
which is at the very end of the main loop and just one step away from
proc_exit(). Then it seems honors a SIGINT within ereport(), longjmp
to the loop head, and waits for another 60 seconds. After timeout, it
ereports shutdown again, and finally exits.

Not sure if this is windows only or general. I can hardly repro it.
Anyone has ever seen this?

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Qingqing Zhou (#1)
1 attachment(s)
Re: rare avl shutdown slowness (related to signal handling)

I got another repro with the shutdown slowness (DEBUG5 with verbosed
log are attached).

It gives a finer picture of what's going on:
1. Avl ereport("autovacuum launcher shutting down");
2. At the end of errfinish(), it honors a pending SIGINT;
3. SIGINT handler longjmp to the start of avl error handling;
4. The error handling continues and rebuild_database_list() (that's
why we see begin/commit pair);
5. In main loop, it WaitLatch(60 seconds);
6. Finally it ereport() again and proc_exit().

This looks like a general pattern - don't think *nix is immune. Notice
that this ereport() is special as there is way to go back. So we can
insert HOLD_INTERRUPTS() just before it.

Thoughts?

Regards,
Qingqing

On Tue, Apr 7, 2015 at 10:54 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:

Show quoted text

I am playing git tip on windows 7/32 bits, with the backend compiled
with visual studio 2005 (I know, it is very old :-( ).

I encountered avl shutdown slowness twice, last night and this
morning: after a ctrl_c is hit, wait for another 60 seconds, server
shuts down. Here is one log:

D:\pgsql\bin>postgres -D../data --log_line_prefix="%t %p"
2015-04-07 10:30:04 PDT 3148LOG: database system was shut down at
2015-04-07 10:29:24 PDT
2015-04-07 10:30:04 PDT 19548LOG: database system is ready to accept
connections
2015-04-07 10:30:04 PDT 27008LOG: autovacuum launcher started
2015-04-07 10:30:08 PDT 19548LOG: received fast shutdown request
2015-04-07 10:30:08 PDT 19548LOG: aborting any active transactions
2015-04-07 10:30:08 PDT 27008LOG: autovacuum launcher shutting down
2015-04-07 10:30:08 PDT 27008ERROR: canceling statement due to user request
2015-04-07 10:31:09 PDT 27008LOG: autovacuum launcher shutting down
2015-04-07 10:31:09 PDT 15656LOG: shutting down
2015-04-07 10:31:09 PDT 15656LOG: database system is shut down

The interesting part is on PID 27008: avl first ereport() shutdown,
which is at the very end of the main loop and just one step away from
proc_exit(). Then it seems honors a SIGINT within ereport(), longjmp
to the loop head, and waits for another 60 seconds. After timeout, it
ereports shutdown again, and finally exits.

Not sure if this is windows only or general. I can hardly repro it.
Anyone has ever seen this?

Regards,
Qingqing

Attachments:

avlexit.logapplication/octet-stream; name=avlexit.logDownload
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#2)
Re: rare avl shutdown slowness (related to signal handling)

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:

I got another repro with the shutdown slowness (DEBUG5 with verbosed
log are attached).

It gives a finer picture of what's going on:
1. Avl ereport("autovacuum launcher shutting down");
2. At the end of errfinish(), it honors a pending SIGINT;
3. SIGINT handler longjmp to the start of avl error handling;
4. The error handling continues and rebuild_database_list() (that's
why we see begin/commit pair);
5. In main loop, it WaitLatch(60 seconds);
6. Finally it ereport() again and proc_exit().

This looks like a general pattern - don't think *nix is immune. Notice
that this ereport() is special as there is way to go back. So we can
insert HOLD_INTERRUPTS() just before it.

Thoughts?

That seems like (a) a hack, and (b) not likely to solve the problem
completely, unless you leave interrupts held throughout proc_exit(),
which would create all sorts of opportunities for corner case bugs
during on_proc_exit hooks.

I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
be a much safer fix.

It looks like there's a related risk associated with this bit:

/* in emergency mode, just start a worker and go away */
if (!AutoVacuumingActive())
{
do_start_worker();
proc_exit(0); /* done */
}

If we get SIGHUP and see that autovacuum has been turned off,
we exit the main loop, but we don't set got_SIGTERM. So if we
then get a similar error at the shutdown report, we'd not merely
waste some time, but actually incorrectly launch a child.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
1 attachment(s)
Re: rare avl shutdown slowness (related to signal handling)

Tom Lane wrote:

That seems like (a) a hack, and (b) not likely to solve the problem
completely, unless you leave interrupts held throughout proc_exit(),
which would create all sorts of opportunities for corner case bugs
during on_proc_exit hooks.

I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
be a much safer fix.

Ah, yeah. I was thinking in changing PG_exception_stack once shutdown
was requested, but this is much simpler.

It looks like there's a related risk associated with this bit:

/* in emergency mode, just start a worker and go away */
if (!AutoVacuumingActive())
{
do_start_worker();
proc_exit(0); /* done */
}

If we get SIGHUP and see that autovacuum has been turned off,
we exit the main loop, but we don't set got_SIGTERM. So if we
then get a similar error at the shutdown report, we'd not merely
waste some time, but actually incorrectly launch a child.

Well, what it does is request a new child from postmaster, which
presumably wouldn't be listening; but yeah there's a lot of other work
done here also.

How about the attached?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

autovac-shutdown.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 481,486 **** AutoVacLauncherMain(int argc, char *argv[])
--- 481,493 ----
  		/* Report the error to the server log */
  		EmitErrorReport();
  
+ 		/* if in shutdown mode, no need for anything further */
+ 		if (got_SIGTERM)
+ 		{
+ 			RESUME_INTERRUPTS();
+ 			goto shutdown;
+ 		}
+ 
  		/* Abort the current transaction in order to recover */
  		AbortCurrentTransaction();
  
***************
*** 542,549 **** AutoVacLauncherMain(int argc, char *argv[])
  	SetConfigOption("default_transaction_isolation", "read committed",
  					PGC_SUSET, PGC_S_OVERRIDE);
  
! 	/* in emergency mode, just start a worker and go away */
! 	if (!AutoVacuumingActive())
  	{
  		do_start_worker();
  		proc_exit(0);			/* done */
--- 549,559 ----
  	SetConfigOption("default_transaction_isolation", "read committed",
  					PGC_SUSET, PGC_S_OVERRIDE);
  
! 	/*
! 	 * in emergency mode, just start a worker and go away -- unless
! 	 * shutdown was requested.
! 	 */
! 	if (!AutoVacuumingActive() && !got_SIGTERM)
  	{
  		do_start_worker();
  		proc_exit(0);			/* done */
***************
*** 560,566 **** AutoVacLauncherMain(int argc, char *argv[])
  	 */
  	rebuild_database_list(InvalidOid);
  
! 	for (;;)
  	{
  		struct timeval nap;
  		TimestampTz current_time = 0;
--- 570,576 ----
  	 */
  	rebuild_database_list(InvalidOid);
  
! 	while (!got_SIGTERM)
  	{
  		struct timeval nap;
  		TimestampTz current_time = 0;
***************
*** 758,763 **** AutoVacLauncherMain(int argc, char *argv[])
--- 768,774 ----
  	}
  
  	/* Normal exit from the autovac launcher is here */
+ shutdown:
  	ereport(LOG,
  			(errmsg("autovacuum launcher shutting down")));
  	AutoVacuumShmem->av_launcherpid = 0;
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#4)
Re: rare avl shutdown slowness (related to signal handling)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
be a much safer fix.

Ah, yeah. I was thinking in changing PG_exception_stack once shutdown
was requested, but this is much simpler.

Your proposed patch seems to be doing both of those, which is probably
unnecessary. I don't object to the SIGHUP test and goto in the error
path, but I'd put it a lot further down, like after the existing
RESUME_INTERRUPTS. I doubt it's a good idea to skip the transaction
cleanup steps.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Tom Lane (#3)
Re: rare avl shutdown slowness (related to signal handling)

On Tue, Apr 7, 2015 at 2:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That seems like (a) a hack, and (b) not likely to solve the problem
completely, unless you leave interrupts held throughout proc_exit(),
which would create all sorts of opportunities for corner case bugs
during on_proc_exit hooks.

Hmm, looks like proc_exit() already taken care of this by setting
proc_exit_inprogress and StatementCancelHandler() respects it.
Actually, in quickdie(), I found a similar practice for the same
reason:

/*
* Prevent interrupts while exiting; though we just blocked signals that
* would queue new interrupts, one may have been pending. We don't want a
* quickdie() downgraded to a mere query cancel.
*/
HOLD_INTERRUPTS();
I do feel that we have too many functions instructing how to handle
interrupts and they are subtle - I just found a new friend
HOLD_CANCEL_INTERRUPTS :-(

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#6)
Re: rare avl shutdown slowness (related to signal handling)

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:

I do feel that we have too many functions instructing how to handle
interrupts and they are subtle - I just found a new friend
HOLD_CANCEL_INTERRUPTS :-(

Indeed, which is why I think a patch for this issue should not introduce
a new mode/context in which proc_exit() is executed. The code path would
be so seldom executed that we'd never manage to isolate any bugs that
might be in it. Thus also my objection to Alvaro's patch that had us
going to proc_exit from partway through the error cleanup sequence.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Tom Lane (#7)
Re: rare avl shutdown slowness (related to signal handling)

On Tue, Apr 7, 2015 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Indeed, which is why I think a patch for this issue should not introduce
a new mode/context in which proc_exit() is executed.

Agree. Along this line, we can add an on_proc_exit hook simply
ereport("we are exiting") there. In this way, we reuse all shields
proc_exit() already has. Abuse?

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: rare avl shutdown slowness (related to signal handling)

Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
be a much safer fix.

Ah, yeah. I was thinking in changing PG_exception_stack once shutdown
was requested, but this is much simpler.

Your proposed patch seems to be doing both of those, which is probably
unnecessary. I don't object to the SIGHUP test and goto in the error
path, but I'd put it a lot further down, like after the existing
RESUME_INTERRUPTS. I doubt it's a good idea to skip the transaction
cleanup steps.

Ah, of course. See attached patch.

We could keep the "for (;;)" instead of having it conditional on
got_SIGTERM, but I don't think there's any harm in the change.

This is backpatchable to 9.0.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

autovac-shutdown-2.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 507,512 **** AutoVacLauncherMain(int argc, char *argv[])
--- 507,516 ----
  		/* Now we can allow interrupts again */
  		RESUME_INTERRUPTS();
  
+ 		/* if in shutdown mode, no need for anything further; just go away */
+ 		if (got_SIGTERM)
+ 			goto shutdown;
+ 
  		/*
  		 * Sleep at least 1 second after any error.  We don't want to be
  		 * filling the error logs as fast as we can.
***************
*** 542,551 **** AutoVacLauncherMain(int argc, char *argv[])
  	SetConfigOption("default_transaction_isolation", "read committed",
  					PGC_SUSET, PGC_S_OVERRIDE);
  
! 	/* in emergency mode, just start a worker and go away */
  	if (!AutoVacuumingActive())
  	{
! 		do_start_worker();
  		proc_exit(0);			/* done */
  	}
  
--- 546,559 ----
  	SetConfigOption("default_transaction_isolation", "read committed",
  					PGC_SUSET, PGC_S_OVERRIDE);
  
! 	/*
! 	 * In emergency mode, just start a worker (unless shutdown was requested)
! 	 * and go away.
! 	 */
  	if (!AutoVacuumingActive())
  	{
! 		if (!got_SIGTERM)
! 			do_start_worker();
  		proc_exit(0);			/* done */
  	}
  
***************
*** 560,566 **** AutoVacLauncherMain(int argc, char *argv[])
  	 */
  	rebuild_database_list(InvalidOid);
  
! 	for (;;)
  	{
  		struct timeval nap;
  		TimestampTz current_time = 0;
--- 568,575 ----
  	 */
  	rebuild_database_list(InvalidOid);
  
! 	/* loop until shutdown request */
! 	while (!got_SIGTERM)
  	{
  		struct timeval nap;
  		TimestampTz current_time = 0;
***************
*** 758,763 **** AutoVacLauncherMain(int argc, char *argv[])
--- 767,773 ----
  	}
  
  	/* Normal exit from the autovac launcher is here */
+ shutdown:
  	ereport(LOG,
  			(errmsg("autovacuum launcher shutting down")));
  	AutoVacuumShmem->av_launcherpid = 0;