Resetting crash time of background worker

Started by Amit Khandekaralmost 11 years ago4 messages
#1Amit Khandekar
amitdkhan.pg@gmail.com

When the postmaster recovers from a backend or worker crash, it resets bg
worker's crash time (rw->rw_crashed_at) so that the bgworker will
immediately restart (ResetBackgroundWorkerCrashTimes).

But resetting rw->rw_crashed_at to 0 means that we have lost the
information that the bgworker had actuallly crashed. So later when
postmaster tries to find any workers that should start
(maybe_start_bgworker), it treats this worker as a new worker, as against
treating it as one that had crashed and is to be restarted. So for this
bgworker, it does not consider BGW_NEVER_RESTART :

if (rw->rw_crashed_at != 0) { if (rw->rw_worker.bgw_restart_time ==
BGW_NEVER_RESTART) { ForgetBackgroundWorker(&iter); continue; } .... ....
That means, it will not remove the worker, and it will be restarted. Now if
the worker again crashes, postmaster would keep on repeating the crash and
restart cycle for the whole system.

From what I understand, BGW_NEVER_RESTART applies even to a crashed server.
But let me know if I am missing anything.

I think we either have to retain the knowledge that the worker has crashed
using some new field, or else, we should reset the crash time only if it is
not flagged BGW_NEVER_RESTART.

-Amit Khandekar

#2Robert Haas
robertmhaas@gmail.com
In reply to: Amit Khandekar (#1)
Re: Resetting crash time of background worker

On Tue, Mar 17, 2015 at 1:33 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

When the postmaster recovers from a backend or worker crash, it resets bg
worker's crash time (rw->rw_crashed_at) so that the bgworker will
immediately restart (ResetBackgroundWorkerCrashTimes).

But resetting rw->rw_crashed_at to 0 means that we have lost the information
that the bgworker had actuallly crashed. So later when postmaster tries to
find any workers that should start (maybe_start_bgworker), it treats this
worker as a new worker, as against treating it as one that had crashed and
is to be restarted. So for this bgworker, it does not consider
BGW_NEVER_RESTART :

if (rw->rw_crashed_at != 0) { if (rw->rw_worker.bgw_restart_time ==
BGW_NEVER_RESTART) { ForgetBackgroundWorker(&iter); continue; } .... ....
That means, it will not remove the worker, and it will be restarted. Now if
the worker again crashes, postmaster would keep on repeating the crash and
restart cycle for the whole system.

From what I understand, BGW_NEVER_RESTART applies even to a crashed server.
But let me know if I am missing anything.

I think we either have to retain the knowledge that the worker has crashed
using some new field, or else, we should reset the crash time only if it is
not flagged BGW_NEVER_RESTART.

I think you're right, and I think we should do the second of those.
Thanks for tracking this down.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Robert Haas (#2)
1 attachment(s)
Re: Resetting crash time of background worker

On 17 March 2015 at 19:12, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 17, 2015 at 1:33 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

I think we either have to retain the knowledge that the worker has

crashed

using some new field, or else, we should reset the crash time only if it

is

not flagged BGW_NEVER_RESTART.

I think you're right, and I think we should do the second of those.
Thanks for tracking this down.

Thanks. Attached a patch accordingly. Put this into the June 2015
commitfest.

Attachments:

reset_crashtimes_fix.patchtext/x-patch; charset=US-ASCII; name=reset_crashtimes_fix.patchDownload
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 85a3b3a..1536691 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -397,9 +397,9 @@ BackgroundWorkerStopNotifications(pid_t pid)
 /*
  * Reset background worker crash state.
  *
- * We assume that, after a crash-and-restart cycle, background workers should
- * be restarted immediately, instead of waiting for bgw_restart_time to
- * elapse.
+ * We assume that, after a crash-and-restart cycle, background workers without
+ * the never-restart flag should be restarted immediately, instead of waiting
+ * for bgw_restart_time to elapse.
  */
 void
 ResetBackgroundWorkerCrashTimes(void)
@@ -411,7 +411,14 @@ ResetBackgroundWorkerCrashTimes(void)
 		RegisteredBgWorker *rw;
 
 		rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-		rw->rw_crashed_at = 0;
+
+		/*
+		 * For workers that should not be restarted, we don't want to loose
+		 * the information that they have crashed, otherwise they would be
+		 * treated as new workers.
+		 */
+		if (rw->rw_worker.bgw_restart_time != BGW_NEVER_RESTART)
+			rw->rw_crashed_at = 0;
 	}
 }
 
#4Robert Haas
robertmhaas@gmail.com
In reply to: Amit Khandekar (#3)
Re: Resetting crash time of background worker

On Sun, Mar 22, 2015 at 10:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 17 March 2015 at 19:12, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 17, 2015 at 1:33 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

I think we either have to retain the knowledge that the worker has
crashed
using some new field, or else, we should reset the crash time only if it
is
not flagged BGW_NEVER_RESTART.

I think you're right, and I think we should do the second of those.
Thanks for tracking this down.

Thanks. Attached a patch accordingly. Put this into the June 2015
commitfest.

Committed and back-patched to 9.4.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers