[patch] demote

Started by Jehan-Guillaume de Rorthaisover 5 years ago30 messages
1 attachment(s)

Hi,

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1]/messages/by-id/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com, with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

I'm still wondering about the coding style, but as the discussion about this
kind of feature is rising, I share it in an early stage so it has a chance to
be discussed.

I'm opening a new discussion to avoid disturbing Amul's one.

The design of my patch is similar to the crash recovery code, without resetting
the shared memory. It supports smart and fast demote. The only existing user
interface currently is "pg_ctl [-m smart|fast] demote". An SQL admin function,
eg. pg_demote(), would be easy to add.

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

Thanks.

Regards,

[1]: /messages/by-id/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com

Attachments:

v1-0001-Demote-PoC.patchtext/x-patchDownload
From 2075441bebc47d3dd5b6e0a76e16f5ebb12858af Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH] Demote PoC

---
 src/backend/access/transam/xlog.c       |   3 +-
 src/backend/postmaster/postmaster.c     | 206 ++++++++++++++++++------
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 105 ++++++++++++
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   7 +-
 src/include/utils/pidfile.h             |   1 +
 7 files changed, 271 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..8a7f1a0855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8493,6 +8493,7 @@ ShutdownXLOG(int code, Datum arg)
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	/* Don't be chatty in standalone mode */
+	// FIXME: what message when demoting?
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
@@ -8760,7 +8761,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = DB_SHUTDOWNING; // DEMOTING?
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..465d020f9d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,6 +150,9 @@
 
 #define BACKEND_TYPE_WORKER		(BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
+/* file to signal demotion from primary to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+
 /*
  * List of active backends (or child processes anyway; we don't actually
  * know whether a given child has become a backend or is still in the
@@ -269,18 +272,23 @@ typedef enum
 static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 
 /* Startup/shutdown state */
-#define			NoShutdown		0
-#define			SmartShutdown	1
-#define			FastShutdown	2
-#define			ImmediateShutdown	3
-
-static int	Shutdown = NoShutdown;
+typedef enum StepDownState {
+	NoShutdown = 0, /* find better label? */
+	SmartShutdown,
+	SmartDemote,
+	FastShutdown,
+	FastDemote,
+	ImmediateShutdown
+} StepDownState;
+
+static StepDownState StepDown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -314,7 +322,7 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * will not be very long).
  *
  * Notice that this state variable does not distinguish *why* we entered
- * states later than PM_RUN --- Shutdown and FatalError must be consulted
+ * states later than PM_RUN --- StepDown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
  * states, nor in PM_SHUTDOWN states (because we don't enter those states
  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
@@ -414,6 +422,8 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
@@ -1550,7 +1560,7 @@ DetermineSleepTime(struct timeval *timeout)
 	 * Normal case: either there are no background workers at all, or we're in
 	 * a shutdown sequence (during which we ignore bgworkers altogether).
 	 */
-	if (Shutdown > NoShutdown ||
+	if (StepDown > NoShutdown ||
 		(!StartWorkerNeeded && !HaveCrashedWorker))
 	{
 		if (AbortStartTime != 0)
@@ -1830,7 +1840,7 @@ ServerLoop(void)
 		 *
 		 * Note we also do this during recovery from a process crash.
 		 */
-		if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
+		if ((StepDown >= ImmediateShutdown || (FatalError && !SendStop)) &&
 			AbortStartTime != 0 &&
 			(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
 		{
@@ -2305,6 +2315,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2436,7 +2451,7 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
+	 * Can't start backends when in startup/demote/shutdown/inconsistent recovery
 	 * state.  We treat autovac workers the same as user backends for this
 	 * purpose.  However, bgworkers are excluded from this test; we expect
 	 * bgworker_should_start_now() decided whether the DB state allows them.
@@ -2452,7 +2467,9 @@ canAcceptConnections(int backend_type)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
-		else if (Shutdown > NoShutdown)
+		else if (StepDown == SmartDemote || StepDown == FastDemote)
+			return CAC_DEMOTE;	/* demote is pending */
+		else if (StepDown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
@@ -2683,7 +2700,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 	PG_SETMASK(&BlockSig);
 #endif
 
-	if (Shutdown <= SmartShutdown)
+	if (StepDown == NoShutdown || StepDown == SmartShutdown ||
+		StepDown == SmartDemote)
 	{
 		ereport(LOG,
 				(errmsg("received SIGHUP, reloading configuration files")));
@@ -2769,26 +2787,72 @@ pmdie(SIGNAL_ARGS)
 			(errmsg_internal("postmaster received signal %d",
 							 postgres_signal_arg)));
 
+	if (CheckDemoteSignal())
+	{
+		if (pmState != PM_RUN)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(LOG,
+					(errmsg("ignoring demote signal because already in standby mode")));
+		}
+		else if (postgres_signal_arg == SIGQUIT) {
+			DemoteSignal = false;
+			ereport(WARNING,
+					(errmsg("can not demote in immediate stop mode")));
+			// FIXME: should we abort the shutdown process?
+		}
+		else
+		{
+			FILE	   *standby_file;
+
+			DemoteSignal = true;
+
+			/* create the standby signal file */
+			standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+			if (!standby_file)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+
+			if (FreeFile(standby_file))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+		}
+
+		unlink(DEMOTE_SIGNAL_FILE);
+	}
+
 	switch (postgres_signal_arg)
 	{
 		case SIGTERM:
 
 			/*
-			 * Smart Shutdown:
+			 * Smart Stepdown:
 			 *
-			 * Wait for children to end their work, then shut down.
+			 * Wait for children to end their work, then shut down or demote.
 			 */
-			if (Shutdown >= SmartShutdown)
+			if (StepDown >= SmartShutdown)
 				break;
-			Shutdown = SmartShutdown;
-			ereport(LOG,
-					(errmsg("received smart shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = SmartDemote;
+				ereport(LOG, (errmsg("received smart demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = SmartShutdown;
+				ereport(LOG, (errmsg("received smart shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -2831,22 +2895,29 @@ pmdie(SIGNAL_ARGS)
 		case SIGINT:
 
 			/*
-			 * Fast Shutdown:
+			 * Fast StepDown:
 			 *
 			 * Abort all children with SIGTERM (rollback active transactions
-			 * and exit) and shut down when they are gone.
+			 * and exit) and shut down or demote when they are gone.
 			 */
-			if (Shutdown >= FastShutdown)
+			if (StepDown >= FastShutdown)
 				break;
-			Shutdown = FastShutdown;
-			ereport(LOG,
-					(errmsg("received fast shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = FastDemote;
+				ereport(LOG, (errmsg("received fast demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = FastShutdown;
+				ereport(LOG, (errmsg("received fast shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (StartupPID != 0)
 				signal_child(StartupPID, SIGTERM);
@@ -2903,9 +2974,9 @@ pmdie(SIGNAL_ARGS)
 			 * terminate remaining ones with SIGKILL, then exit without
 			 * attempt to properly shut down the data base system.
 			 */
-			if (Shutdown >= ImmediateShutdown)
+			if (StepDown >= ImmediateShutdown)
 				break;
-			Shutdown = ImmediateShutdown;
+			StepDown = ImmediateShutdown;
 			ereport(LOG,
 					(errmsg("received immediate shutdown request")));
 
@@ -2967,10 +3038,11 @@ reaper(SIGNAL_ARGS)
 			StartupPID = 0;
 
 			/*
-			 * Startup process exited in response to a shutdown request (or it
-			 * completed normally regardless of the shutdown request).
+			 * Startup process exited in response to a shutdown or demote
+			 * request (or it completed normally regardless of the shutdown
+			 * request).
 			 */
-			if (Shutdown > NoShutdown &&
+			if (StepDown > NoShutdown &&
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
@@ -2984,7 +3056,7 @@ reaper(SIGNAL_ARGS)
 				ereport(LOG,
 						(errmsg("shutdown at recovery target")));
 				StartupStatus = STARTUP_NOT_RUNNING;
-				Shutdown = SmartShutdown;
+				StepDown = SmartShutdown;
 				TerminateChildren(SIGTERM);
 				pmState = PM_WAIT_BACKENDS;
 				/* PostmasterStateMachine logic does the rest */
@@ -3124,7 +3196,7 @@ reaper(SIGNAL_ARGS)
 				 * archive cycle and quit. Likewise, if we have walsender
 				 * processes, tell them to send any remaining WAL and quit.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 
 				/* Waken archiver for the last time */
 				if (PgArchPID != 0)
@@ -3484,7 +3556,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	take_action = !FatalError && StepDown != ImmediateShutdown;
 
 	if (take_action)
 	{
@@ -3702,7 +3774,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 
 	/* We do NOT restart the syslogger */
 
-	if (Shutdown != ImmediateShutdown)
+	if (StepDown != ImmediateShutdown)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
@@ -3845,11 +3917,11 @@ PostmasterStateMachine(void)
 			WalReceiverPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && StepDown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (StepDown >= ImmediateShutdown || FatalError)
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3870,7 +3942,7 @@ PostmasterStateMachine(void)
 				 * the regular children are gone, and it's time to tell the
 				 * checkpointer to do a shutdown checkpoint.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 				/* Start the checkpointer if not running */
 				if (CheckpointerPID == 0)
 					CheckpointerPID = StartCheckpointer();
@@ -3958,7 +4030,8 @@ PostmasterStateMachine(void)
 	 * EOF on its input pipe, which happens when there are no more upstream
 	 * processes.
 	 */
-	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+	if (pmState == PM_NO_CHILDREN && (StepDown == SmartShutdown ||
+		StepDown == FastShutdown || StepDown == ImmediateShutdown))
 	{
 		if (FatalError)
 		{
@@ -3991,15 +4064,29 @@ PostmasterStateMachine(void)
 	 * startup process fails, because more than likely it will just fail again
 	 * and we will keep trying forever.
 	 */
-	if (pmState == PM_NO_CHILDREN &&
+	if (pmState == PM_NO_CHILDREN && !DemoteSignal &&
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+	/* Handle demote signal */
+	if (DemoteSignal && pmState == PM_NO_CHILDREN)
+	{
+		ereport(LOG, (errmsg("all server processes terminated; demoting")));
+
+		// Signal bgworkers?
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		StartupStatus = STARTUP_RUNNING;
+		pmState = PM_STARTUP;
+		StepDown = NoShutdown;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
 	 */
-	if (FatalError && pmState == PM_NO_CHILDREN)
+	else if (FatalError && pmState == PM_NO_CHILDREN)
 	{
 		ereport(LOG,
 				(errmsg("all server processes terminated; reinitializing")));
@@ -5195,7 +5282,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	 * first. We don't want to go back to recovery in that case.
 	 */
 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
-		pmState == PM_STARTUP && Shutdown == NoShutdown)
+		pmState == PM_STARTUP && StepDown == NoShutdown)
 	{
 		/* WAL redo has started. We're out of reinitialization. */
 		FatalError = false;
@@ -5234,7 +5321,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
-		pmState == PM_RECOVERY && Shutdown == NoShutdown)
+		pmState == PM_RECOVERY && StepDown == NoShutdown)
 	{
 		/*
 		 * Likewise, start other special children as needed.
@@ -5284,7 +5371,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/*
 		 * Start one iteration of the autovacuum daemon, even if autovacuuming
@@ -5299,7 +5386,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/* The autovacuum launcher wants us to start a worker process. */
 		StartAutovacuumWorker();
@@ -5644,7 +5731,7 @@ MaybeStartWalReceiver(void)
 	if (WalReceiverPID == 0 &&
 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
 		if (WalReceiverPID != 0)
@@ -6647,3 +6734,18 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Check if a promote request appeared. Should be called by postmaster before
+ * shutting down.
+ */
+bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..c144cc35d3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3c03ace7ed..0bb7d69682 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,103 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+
+	snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send stop signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2452,6 +2552,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2559,6 +2661,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..f529f8c7bd 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 179ebaa104..a9e27f009e 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK,
+	CAC_STARTUP,
+	CAC_DEMOTE,
+	CAC_SHUTDOWN,
+	CAC_RECOVERY,
+	CAC_TOOMANY,
 	CAC_WAITBACKUP
 } CAC_state;
 
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

#2Robert Haas
robertmhaas@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: [patch] demote

On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

Cool! This was vaguely on my hit list, but neither I nor any of my
colleagues had gotten the time and energy to have a go at it.

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing. With both patches
in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary

I think this would be waaaaay better than what you have to do today,
which as I mentioned in my reply to Tom on the other thread, is very
complicated and error-prone. I think with the combination of that
patch and this one (or some successor version of each) we could get to
a point where the tooling to do a clean switchover is relatively easy
to write and doesn't involve having to shut down the server completely
at any point. If we can do it while also preserving connections, at
least for read-only queries, that's a better user experience, but as
Tom pointed out over there, there are real concerns about the
complexity of these patches, so it may be that the approach you've
taken of just killing everything is safer and thus a superior choice
overall.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Andres Freund
andres@anarazel.de
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: [patch] demote

Hi,

On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

To make sure we are on the same page: What your patch intends to do is
to leave the server running, but switch from being a primary to
replicating from another system. Correct?

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

FWIW just doing that for normal backends isn't sufficient, you also have
to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
to hint bits, the checkpoint record, and some more).

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

I don't think there's a fundamental issue, but I think it needs to deal
with a lot more things than it does right now. StartupXLOG doesn't
currently deal correctly with subsystems that are already
initialized. And your patch doesn't make it so as far as I can tell.

Greetings,

Andres Freund

In reply to: Robert Haas (#2)
Re: [patch] demote

On Wed, 17 Jun 2020 12:29:31 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

[...]

Main difference with Amul's patch is that all backends must be disconnected
to process with the demote. Either we wait for them to disconnect (smart)
or we kill them (fast). This makes life much easier from the code point of
view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')"
as an admin would kill the session, with no options to wait for the action
to finish, as we do with pg_promote(). Keeping read only session around
could probably be achieved using global barrier as Amul did, but without
all the complexity related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an
humble patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing.

That was part of my feeling. I like the idea of keeping readonly backends
around. But I'm not convinced by the "ALTER SYSTEM READ ONLY" feature on its
own.

At some expense, Admin can already set the system as readonly from the
application point of view, using:

alter system set default_transaction_read_only TO on;
select pg_reload_conf();

Current RW xact will finish, but no other will be allowed.

With both patches in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary

I'm not sure how useful ALTER SYSTEM READ ONLY is, outside of the switchover
scope. This seems like it should be included in the demote process itself. If we
focus on user experience, my first original goal was:

* demote the primary
* promote a standby

Later down the path of various additional patches (keep readonly backends, add
pg_demote(), etc), we could extend the replication protocol so a switchover can
be negotiated and controlled from the nodes themselves.

I think this would be waaaaay better than what you have to do today,
which as I mentioned in my reply to Tom on the other thread, is very
complicated and error-prone.

Well, I agree, the current procedure to achieve a clean switchover is
difficult from the user point of view.

For the record, in PAF (a Pacemaker user agent) we are parsing the pg_waldump
output to check if the designated standby to promote received the shutdown
checkpoint from the primary. If it does, we accept promoting.

Manually, I usually shutdown the primary, checkpoint on standby, compare "REDO
location" from both side, then promote.

I think with the combination of that
patch and this one (or some successor version of each) we could get to
a point where the tooling to do a clean switchover is relatively easy
to write and doesn't involve having to shut down the server completely
at any point.

That would be great yes.

If we can do it while also preserving connections, at
least for read-only queries, that's a better user experience, but as
Tom pointed out over there, there are real concerns about the
complexity of these patches, so it may be that the approach you've
taken of just killing everything is safer and thus a superior choice
overall.

As far as this approach doesn't close futur doors to keep readonly backends
around, that might be a good first step.

Thank you!

In reply to: Andres Freund (#3)
Re: [patch] demote

On Wed, 17 Jun 2020 11:14:47 -0700
Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with
to step down an instance from primary to standby status.

To make sure we are on the same page: What your patch intends to do is
to leave the server running, but switch from being a primary to
replicating from another system. Correct?

Yes. The instance status is retrograded from "in production" to "in archive
recovery".

Of course, it will start replicating depending on archive_mode/command and
primary_conninfo setup.

Main difference with Amul's patch is that all backends must be disconnected
to process with the demote. Either we wait for them to disconnect (smart)
or we kill them (fast). This makes life much easier from the code point of
view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')"
as an admin would kill the session, with no options to wait for the action
to finish, as we do with pg_promote(). Keeping read only session around
could probably be achieved using global barrier as Amul did, but without
all the complexity related to WAL writes prohibition.

FWIW just doing that for normal backends isn't sufficient, you also have
to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
to hint bits, the checkpoint record, and some more).

In fact, the patch relies on existing code path in the state machine. The
startup process is started when the code enters in PM_NO_CHILDREN state. This
state is set when «These other guys should be dead already» as stated in the
code:

/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
pmState = PM_NO_CHILDREN;

There's still some questions in the current patch. As I wrote, it's an
humble patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

I don't think there's a fundamental issue, but I think it needs to deal
with a lot more things than it does right now. StartupXLOG doesn't
currently deal correctly with subsystems that are already
initialized. And your patch doesn't make it so as far as I can tell.

If you are talking about bgwriter, checkpointer, etc, as far as I understand
the current state machine, my patch actually deal with them.

Thank you for your feedback!

I'll study how hard it would be to keep read only backends around during the
demote step.

Regards,

#6Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Robert Haas (#2)
Re: [patch] demote

On 2020/06/18 1:29, Robert Haas wrote:

On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

Cool! This was vaguely on my hit list, but neither I nor any of my
colleagues had gotten the time and energy to have a go at it.

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing. With both patches
in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#7Robert Haas
robertmhaas@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#4)
Re: [patch] demote

On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

At some expense, Admin can already set the system as readonly from the
application point of view, using:

alter system set default_transaction_read_only TO on;
select pg_reload_conf();

Current RW xact will finish, but no other will be allowed.

That doesn't block all WAL generation, though:

rhaas=# alter system set default_transaction_read_only TO on;
ALTER SYSTEM
rhaas=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
rhaas=#

There's a bunch of other things it also doesn't block, too. If you're
trying to switch to a new primary, you really want to stop WAL
generation completely on the old one. Otherwise, you can't guarantee
that the machine you're going to promote is completely caught up,
which means you might lose some changes, and you might have to
pg_rewind the old master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#6)
Re: [patch] demote

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO. On the other hand, if a machine can be demoted just
by kicking out R/W sessions, as ASRO currently does, then maybe
there's not that much difference. Or maybe both designs are subject to
improvement and we can do something even less invasive...

One thing I think people are going to want to do is have the master go
read-only if it loses communication to the rest of the network, to
avoid or at least mitigate split-brain. However, such network
interruptions are often transient, so it might not be uncommon to
briefly go read-only due to a network blip, but then recover quickly
and return to a read-write state. It doesn't seem to matter much
whether that read-only state is a new kind of normal operation (like
what ASRO would do) or whether we've actually returned to a recovery
state (as demote would do) but the collateral effects of the state
change do matter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#7)
Re: [patch] demote

On Thu, 18 Jun 2020 11:15:02 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

At some expense, Admin can already set the system as readonly from the
application point of view, using:

alter system set default_transaction_read_only TO on;
select pg_reload_conf();

Current RW xact will finish, but no other will be allowed.

That doesn't block all WAL generation, though:

rhaas=# alter system set default_transaction_read_only TO on;
ALTER SYSTEM
rhaas=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
rhaas=#

Yes, this, and the fact that any user can switch transaction_read_only back to
on easily. This was a terrible example.

My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a
required user feature, outside of the demote scope. It might be useful for the
demote process, but only from the core point of view, without user interaction.
It seems there's no other purpose from the admin standpoint.

There's a bunch of other things it also doesn't block, too. If you're
trying to switch to a new primary, you really want to stop WAL
generation completely on the old one. Otherwise, you can't guarantee
that the machine you're going to promote is completely caught up,
which means you might lose some changes, and you might have to
pg_rewind the old master.

Yes, of course. I wasn't explaining transaction_read_only was useful in a
switchover procedure, sorry for the confusion and misleading comment.

Regards,

In reply to: Robert Haas (#8)
Re: [patch] demote

On Thu, 18 Jun 2020 11:22:47 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be
streamed to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO. On the other hand, if a machine can be demoted just
by kicking out R/W sessions, as ASRO currently does, then maybe
there's not that much difference. Or maybe both designs are subject to
improvement and we can do something even less invasive...

Considering the current demote patch improvement. I was considering to digg in
the following direction:

* add a new state in the state machine where all backends are idle
* this new state forbid any new writes, the same fashion we do on standby nodes
* this state could either wait for end of xact, or cancel/kill
RW backends, in the same fashion current smart/fast stop do
* from this state, we might then rollback pending prepared xact, stop other
sub-process etc (as the current patch does), and demote safely to
PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).

Is it something worth considering?
Maybe the code will be so close from ASRO, it would just be kind of a fusion of
both patch? I don't know, I didn't look at the ASRO patch yet.

One thing I think people are going to want to do is have the master go
read-only if it loses communication to the rest of the network, to
avoid or at least mitigate split-brain. However, such network
interruptions are often transient, so it might not be uncommon to
briefly go read-only due to a network blip, but then recover quickly
and return to a read-write state. It doesn't seem to matter much
whether that read-only state is a new kind of normal operation (like
what ASRO would do) or whether we've actually returned to a recovery
state (as demote would do) but the collateral effects of the state
change do matter.

Well, triggering such actions (demote or read only) often occurs external
decision, hopefully relying on at least some quorum and being able to escalade
to watchdog or fencing is required.

Most tools around will need to demote or fence. It seems dangerous to flip
between read only/read write on a bunch of cluster nodes. It might be quickly
messy, especially since a former primary with non replicated data could
automatically replicate from a new primary without screaming...

Regards,

#11Robert Haas
robertmhaas@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#10)
Re: [patch] demote

On Thu, Jun 18, 2020 at 11:56 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Considering the current demote patch improvement. I was considering to digg in
the following direction:

* add a new state in the state machine where all backends are idle
* this new state forbid any new writes, the same fashion we do on standby nodes
* this state could either wait for end of xact, or cancel/kill
RW backends, in the same fashion current smart/fast stop do
* from this state, we might then rollback pending prepared xact, stop other
sub-process etc (as the current patch does), and demote safely to
PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).

Is it something worth considering?
Maybe the code will be so close from ASRO, it would just be kind of a fusion of
both patch? I don't know, I didn't look at the ASRO patch yet.

I don't think that the postmaster state machine is the interesting
part of this problem. The tricky parts have to do with updating shared
memory state, and with updating per-backend private state. For
example, snapshots are taken in a different way during recovery than
they are in normal operation, hence SnapshotData's takenDuringRecovery
member. And I think that we allocate extra shared memory space for
storing the data that those snapshots use if, and only if, the server
starts up in recovery. So if the server goes backward from normal
running into recovery, we might not have the space that we need in
shared memory to store the extra data, and even if we had the space it
might not be populated correctly, and the code that takes snapshots
might not be written properly to handle multiple transitions between
recovery and normal running, or even a single backward transition.

In general, there's code scattered all throughout the system that
assumes the recovery -> normal running transition is one-way. If we go
back into recovery by killing off all backends and reinitializing
shared memory, then we don't have to worry about that stuff. If we do
anything less than that, we have to find all the code that relies on
never reentering recovery and fix it all. Now it's also true that we
have to do some other things, like restarting the startup process, and
stopping things like autovacuum, and the postmaster may need to be
involved in some of that. There's clearly some engineering work there,
but I think it's substantially less than the amount of engineering
work involved in fixing problems with shared memory contents and
backend-local state.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Robert Haas (#8)
Re: [patch] demote

On 2020/06/19 0:22, Robert Haas wrote:

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO.

Even with ASRO, the server restart is necessary and RO sessions are
kicked out when demoting RO primary to a standby, i.e., during a clean
switchover?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#13Andres Freund
andres@anarazel.de
In reply to: Jehan-Guillaume de Rorthais (#5)
Re: [patch] demote

Hi,

On 2020-06-18 12:16:27 +0200, Jehan-Guillaume de Rorthais wrote:

On Wed, 17 Jun 2020 11:14:47 -0700

I don't think there's a fundamental issue, but I think it needs to deal
with a lot more things than it does right now. StartupXLOG doesn't
currently deal correctly with subsystems that are already
initialized. And your patch doesn't make it so as far as I can tell.

If you are talking about bgwriter, checkpointer, etc, as far as I understand
the current state machine, my patch actually deal with them.

I'm talking about subsystems like subtrans multixact etc not being ok
with suddenly being initialized a second time. You cannot call
StartupXLOG twice without making modifications to it.

Greetings,

Andres Freund

#14Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#6)
Re: [patch] demote

Hi,

On 2020-06-18 21:41:45 +0900, Fujii Masao wrote:

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.

As far as I can tell this step includes ALTER SYSTEM READ ONLY. Sure you
can choose not to expose ASRO to the user separately from demote, but
that's a miniscule part of the complexity.

Greetings,

Andres Freund

#15Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#12)
Re: [patch] demote

On Thu, Jun 18, 2020 at 12:55 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Even with ASRO, the server restart is necessary and RO sessions are
kicked out when demoting RO primary to a standby, i.e., during a clean
switchover?

The ASRO patch doesn't provide a way to put a running server to be put
back into recovery, so yes, that is required, unless some other patch
fixes it so that it isn't. It wouldn't be better to find a way where
we never need to kill of R/O sessions at all, and I think that would
require all the machinery from the ASRO patch plus some more. If you
want to allow sessions to survive a state transition like this -
whether it's to a WAL-read-only state or all the way back to recovery
- you need a way to prevent further WAL writes in those sessions. Most
of the stuff that the ASRO patch does is concerned with that. So it
doesn't seem likely to me that we can just throw all that code away,
unless by chance somebody else has got a better version of the same
thing already. To go back to recovery rather than just to a read-only
state, I think you'd need to grapple with some additional issues that
patch doesn't touch, like some of the snapshot-taking stuff, but I
think you still need to solve all of the problems that it does deal
with, unless you're OK with killing every session.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: [patch] demote

Robert Haas <robertmhaas@gmail.com> writes:

... To go back to recovery rather than just to a read-only
state, I think you'd need to grapple with some additional issues that
patch doesn't touch, like some of the snapshot-taking stuff, but I
think you still need to solve all of the problems that it does deal
with, unless you're OK with killing every session.

It seems like this is the core decision that needs to be taken. If
we're willing to have these state transitions include a server restart,
then many things get simpler. If we're not, it's gonna cost us in
code complexity and hence bugs. Maybe the usability gain is worth it,
or maybe not.

I think it would probably be worth the trouble to pursue both designs in
parallel for awhile, so we can get a better handle on exactly how much
complexity we're buying into with the more ambitious definition.

regards, tom lane

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#16)
Re: [patch] demote

Hi,

On 2020-06-18 13:24:38 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... To go back to recovery rather than just to a read-only
state, I think you'd need to grapple with some additional issues that
patch doesn't touch, like some of the snapshot-taking stuff, but I
think you still need to solve all of the problems that it does deal
with, unless you're OK with killing every session.

It seems like this is the core decision that needs to be taken. If
we're willing to have these state transitions include a server restart,
then many things get simpler. If we're not, it's gonna cost us in
code complexity and hence bugs. Maybe the usability gain is worth it,
or maybe not.

I think it would probably be worth the trouble to pursue both designs in
parallel for awhile, so we can get a better handle on exactly how much
complexity we're buying into with the more ambitious definition.

What I like about ALTER SYSTEM READ ONLY is that it basically would
likely be a part of both a restart and a non-restart based
implementation.

I don't really get why the demote in this thread is mentioned as an
alternative - it pretty obviously has to include a large portion of
ALTER SYSTEM READ ONLY.

The only part that could really be skipped by going straight to demote
is a way to make ASRO invocable directly. You can simplify a bit more by
killing all user sessions, but at that point there's not that much
upshot for having no-restart version of demote in the first place.

The demote patch in this thread doesn't even start to attack much of the
real complexity around turning a primary into a standby.

To me the complexity of a restartless demotion are likely worth it. But
it doesn't seem feasible to get there in one large step. So adding
individually usable sub-steps like ASRO makes sense imo.

Greetings,

Andres Freund

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: [patch] demote

On Thu, Jun 18, 2020 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It seems like this is the core decision that needs to be taken. If
we're willing to have these state transitions include a server restart,
then many things get simpler. If we're not, it's gonna cost us in
code complexity and hence bugs. Maybe the usability gain is worth it,
or maybe not.

I think it would probably be worth the trouble to pursue both designs in
parallel for awhile, so we can get a better handle on exactly how much
complexity we're buying into with the more ambitious definition.

I wouldn't vote to reject a patch that performed a demotion by doing a
full server restart, because it's a useful incremental step, but I
wouldn't be excited about spending a lot of time on it, either,
because it's basically crippleware by design. Either you have to
checkpoint before restarting, or you have to run recovery after
restarting, and either of those can be extremely slow. You also break
all the connections, which can produce application errors unless the
applications are pretty robustly designed, and you lose the entire
contents of shared_buffers, which makes things run very slowly even
after the restart is completed, which can cause a lengthy slow period
even after the system is nominally back up. All of those things are
really bad, and AFAICT the first one is the worst by a considerable
margin. It can take 20 minutes to checkpoint and even longer to run
recovery, so doing this once per year already puts you outside of
five-nines territory, and we release critical updates that cannot be
applied without a server restart about four times per year. That means
that if you perform updates by using a switchover -- a common practice
-- and if you apply all of your updates in a timely fashion --
unfortunately, not such a common practice, but one we'd surely like to
encourage -- you can't even achieve four nines if a switchover
requires either a checkpoint or running recovery. And that's without
accounting for any switchovers that you may need to perform for
reasons unrelated to upgrades, and without accounting for any
unplanned downtime. Not many people in 2020 are interested in running
a system with three nines of availability, so I think it is clear that
we need to do better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Jehan-Guillaume de Rorthais (#1)
1 attachment(s)
Re: [patch] demote

Hi,

Here is a summary of my work during the last few days on this demote approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further after
being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running checkpoint?
Or leave it to its synching work but discard the result without wirting it to
XLog?

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

Regards,

Attachments:

v2-0001-Demote-PoC.patchtext/x-patchDownload
From 03c41dd706648cd20df90a128db64eee6b6dad97 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH] Demote PoC

Changes:

* creates a demote checkpoint
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, checkpointer, bgwriter and optionally archiver or wal
  senders
* add signal PMSIGNAL_DEMOTING to start the startup process after the
  demote checkpoint
* ShutdownXLOG takes a boolean arg to handle demote differently

Trivial manual tests:

* make check : OK
* make check-world : OK
* start in production -> demote -> demote: OK
* start in production -> demote -> stop : OK
* start in production -> demote -> promote : NOK (2PC, see TODO)
  but OK with no prepared xact.

Discuss/Todo:

* rollback prepared xact
* cancel/kill active/idle in xact R/W backends
  * pg_demote() function?
* some more code reviewing around StartupXlog
* investigate snapshots shmem needs/init during recovery compare to
  production
* add tap tests
* add doc
* how to handle checkpoint?
---
 src/backend/access/rmgrdesc/xlogdesc.c  |   9 +-
 src/backend/access/transam/xlog.c       | 287 +++++++++++++++---------
 src/backend/postmaster/checkpointer.c   |  22 ++
 src/backend/postmaster/postmaster.c     | 250 ++++++++++++++++-----
 src/backend/storage/ipc/procsignal.c    |   4 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 111 +++++++++
 src/include/access/xlog.h               |  18 +-
 src/include/catalog/pg_control.h        |   2 +
 src/include/libpq/libpq-be.h            |   7 +-
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/pmsignal.h          |   1 +
 src/include/storage/procsignal.h        |   1 +
 src/include/utils/pidfile.h             |   1 +
 14 files changed, 537 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..5aeaff18f8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -40,7 +40,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	if (info == XLOG_CHECKPOINT_SHUTDOWN ||
-		info == XLOG_CHECKPOINT_ONLINE)
+		info == XLOG_CHECKPOINT_ONLINE ||
+		info == XLOG_CHECKPOINT_DEMOTE)
 	{
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
@@ -65,7 +66,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
-						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
+						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" :
+							(info == XLOG_CHECKPOINT_DEMOTE)? "demote" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
 	{
@@ -185,6 +187,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_DEMOTE:
+			id = "CHECKPOINT_DEMOTE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e455384b5b..0e18e546ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6301,6 +6301,13 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ *
+ * FIXME demote: rollback prepared xact during demote?
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6324,6 +6331,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		fast_promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6388,6 +6396,16 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6421,7 +6439,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6678,6 +6697,9 @@ StartupXLOG(void)
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+
+		if (is_demoting)
+			Assert((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_DEMOTE);
 	}
 
 	/*
@@ -6739,9 +6761,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextFullXid),
@@ -6775,47 +6797,74 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
 
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
 
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
 
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
 
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
 
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+
+		/*
+		 * Before running in recovery, scan pg_twophase and fill in its status to
+		 * be able to work on entries generated by redo.  Doing a scan before
+		 * taking any recovery action has the merit to discard any 2PC files that
+		 * are newer than the first record to replay, saving from any conflicts at
+		 * replay.  This avoids as well any subsequent scans when doing recovery
+		 * of the on-disk two-phase data.
+		 */
+		restoreTwoPhaseData();
+	}
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6824,30 +6873,6 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the master, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
-
-	/*
-	 * Before running in recovery, scan pg_twophase and fill in its status to
-	 * be able to work on entries generated by redo.  Doing a scan before
-	 * taking any recovery action has the merit to discard any 2PC files that
-	 * are newer than the first record to replay, saving from any conflicts at
-	 * replay.  This avoids as well any subsequent scans when doing recovery
-	 * of the on-disk two-phase data.
-	 */
-	restoreTwoPhaseData();
-
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -6985,7 +7010,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7061,8 +7087,11 @@ StartupXLOG(void)
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
 			 */
-			StartupCLOG();
-			StartupSUBTRANS(oldestActiveXID);
+			if (!is_demoting)
+			{
+				StartupCLOG();
+				StartupSUBTRANS(oldestActiveXID);
+			}
 
 			/*
 			 * If we're beginning at a shutdown checkpoint, we know that
@@ -7070,7 +7099,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7093,9 +7122,10 @@ StartupXLOG(void)
 				running.xids = xids;
 
 				ProcArrayApplyRecoveryInfo(&running);
+			}
 
+			if (wasShutdown)
 				StandbyRecoverPreparedTransactions();
-			}
 		}
 
 		/* Initialize resource managers */
@@ -7941,6 +7971,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8292,7 +8323,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 	}
 	info = record->xl_info & ~XLR_INFO_MASK;
 	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
-		info != XLOG_CHECKPOINT_ONLINE)
+		info != XLOG_CHECKPOINT_ONLINE &&
+		info != XLOG_CHECKPOINT_DEMOTE)
 	{
 		switch (whichChkpt)
 		{
@@ -8486,6 +8518,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool isDemoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8495,36 +8529,56 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (isDemoting)
+	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		LocalRecoveryInProgress = true;
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
+
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
@@ -8537,9 +8591,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8675,6 +8730,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8703,6 +8759,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8723,6 +8780,14 @@ CreateCheckPoint(int flags)
 	else
 		shutdown = false;
 
+	/*
+	 * An demote checkpoint is kind of a shutdown checkpoint as well
+	 */
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8760,10 +8825,10 @@ CreateCheckPoint(int flags)
 	 */
 	START_CRIT_SECTION();
 
-	if (shutdown)
+	if (shutdown || demote)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8809,7 +8874,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8980,7 +9045,7 @@ CreateCheckPoint(int flags)
 	 * If we are shutting down, or Startup process is completing crash
 	 * recovery we don't need to write running xact data.
 	 */
-	if (!shutdown && XLogStandbyInfoActive())
+	if (!(shutdown||demote) && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
 	START_CRIT_SECTION();
@@ -8990,20 +9055,23 @@ CreateCheckPoint(int flags)
 	 */
 	XLogBeginInsert();
 	XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
-	recptr = XLogInsert(RM_XLOG_ID,
-						shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
-						XLOG_CHECKPOINT_ONLINE);
+	if (demote)
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_DEMOTE);
+	else if (shutdown)
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN);
+	else
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE);
 
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
-	if (shutdown)
+	if (shutdown||demote)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
 			LocalXLogInsertAllowed = -1;	/* return to "check" state */
@@ -9015,9 +9083,10 @@ CreateCheckPoint(int flags)
 	 * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
 	 * = end of actual checkpoint record.
 	 */
-	if (shutdown && checkPoint.redo != ProcLastRecPtr)
+	if ((shutdown||demote) && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						shutdown? "shutting down":"demoting")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9087,7 +9156,7 @@ CreateCheckPoint(int flags)
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
-	if (!shutdown)
+	if (!(shutdown||demote))
 		PreallocXlogFiles(recptr);
 
 	/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..cf8ea2a601 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -52,6 +52,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -151,6 +152,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +554,14 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		SendPostmasterSignal(PMSIGNAL_DEMOTING);
+		/* no need to exit the checkpointer during demote */
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +690,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +823,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..d5cc63f697 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,6 +150,9 @@
 
 #define BACKEND_TYPE_WORKER		(BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
+/* file to signal demotion from primary to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+
 /*
  * List of active backends (or child processes anyway; we don't actually
  * know whether a given child has become a backend or is still in the
@@ -269,18 +272,23 @@ typedef enum
 static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 
 /* Startup/shutdown state */
-#define			NoShutdown		0
-#define			SmartShutdown	1
-#define			FastShutdown	2
-#define			ImmediateShutdown	3
-
-static int	Shutdown = NoShutdown;
+typedef enum StepDownState {
+	NoShutdown = 0, /* find better label? */
+	SmartShutdown,
+	SmartDemote,
+	FastShutdown,
+	FastDemote,
+	ImmediateShutdown
+} StepDownState;
+
+static StepDownState StepDown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -314,7 +322,7 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * will not be very long).
  *
  * Notice that this state variable does not distinguish *why* we entered
- * states later than PM_RUN --- Shutdown and FatalError must be consulted
+ * states later than PM_RUN --- StepDown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
  * states, nor in PM_SHUTDOWN states (because we don't enter those states
  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
@@ -414,6 +422,8 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
@@ -1550,7 +1560,7 @@ DetermineSleepTime(struct timeval *timeout)
 	 * Normal case: either there are no background workers at all, or we're in
 	 * a shutdown sequence (during which we ignore bgworkers altogether).
 	 */
-	if (Shutdown > NoShutdown ||
+	if (StepDown > NoShutdown ||
 		(!StartWorkerNeeded && !HaveCrashedWorker))
 	{
 		if (AbortStartTime != 0)
@@ -1830,7 +1840,7 @@ ServerLoop(void)
 		 *
 		 * Note we also do this during recovery from a process crash.
 		 */
-		if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
+		if ((StepDown >= ImmediateShutdown || (FatalError && !SendStop)) &&
 			AbortStartTime != 0 &&
 			(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
 		{
@@ -2305,6 +2315,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2436,7 +2451,7 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
+	 * Can't start backends when in startup/demote/shutdown/inconsistent recovery
 	 * state.  We treat autovac workers the same as user backends for this
 	 * purpose.  However, bgworkers are excluded from this test; we expect
 	 * bgworker_should_start_now() decided whether the DB state allows them.
@@ -2452,7 +2467,9 @@ canAcceptConnections(int backend_type)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
-		else if (Shutdown > NoShutdown)
+		else if (StepDown == SmartDemote || StepDown == FastDemote)
+			return CAC_DEMOTE;	/* demote is pending */
+		else if (StepDown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
@@ -2683,7 +2700,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 	PG_SETMASK(&BlockSig);
 #endif
 
-	if (Shutdown <= SmartShutdown)
+	if (StepDown == NoShutdown || StepDown == SmartShutdown ||
+		StepDown == SmartDemote)
 	{
 		ereport(LOG,
 				(errmsg("received SIGHUP, reloading configuration files")));
@@ -2769,26 +2787,81 @@ pmdie(SIGNAL_ARGS)
 			(errmsg_internal("postmaster received signal %d",
 							 postgres_signal_arg)));
 
+	if (CheckDemoteSignal())
+	{
+		if (pmState != PM_RUN)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(LOG,
+					(errmsg("ignoring demote signal because already in standby mode")));
+			goto out;
+		}
+		else if (postgres_signal_arg == SIGQUIT)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(WARNING,
+					(errmsg("can not demote in immediate stop mode")));
+			goto out;
+		}
+		else
+		{
+			FILE	   *standby_file;
+
+			DemoteSignal = true;
+
+			unlink(DEMOTE_SIGNAL_FILE);
+
+			/* create the standby signal file */
+			standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+			if (!standby_file)
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+
+			if (FreeFile(standby_file))
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+		}
+	}
+
 	switch (postgres_signal_arg)
 	{
 		case SIGTERM:
 
 			/*
-			 * Smart Shutdown:
+			 * Smart Stepdown:
 			 *
-			 * Wait for children to end their work, then shut down.
+			 * Wait for children to end their work, then shut down or demote.
 			 */
-			if (Shutdown >= SmartShutdown)
+			if (StepDown >= SmartShutdown)
 				break;
-			Shutdown = SmartShutdown;
-			ereport(LOG,
-					(errmsg("received smart shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = SmartDemote;
+				ereport(LOG, (errmsg("received smart demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = SmartShutdown;
+				ereport(LOG, (errmsg("received smart shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -2831,22 +2904,29 @@ pmdie(SIGNAL_ARGS)
 		case SIGINT:
 
 			/*
-			 * Fast Shutdown:
+			 * Fast StepDown:
 			 *
 			 * Abort all children with SIGTERM (rollback active transactions
-			 * and exit) and shut down when they are gone.
+			 * and exit) and shut down or demote when they are gone.
 			 */
-			if (Shutdown >= FastShutdown)
+			if (StepDown >= FastShutdown)
 				break;
-			Shutdown = FastShutdown;
-			ereport(LOG,
-					(errmsg("received fast shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = FastDemote;
+				ereport(LOG, (errmsg("received fast demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = FastShutdown;
+				ereport(LOG, (errmsg("received fast shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (StartupPID != 0)
 				signal_child(StartupPID, SIGTERM);
@@ -2903,9 +2983,9 @@ pmdie(SIGNAL_ARGS)
 			 * terminate remaining ones with SIGKILL, then exit without
 			 * attempt to properly shut down the data base system.
 			 */
-			if (Shutdown >= ImmediateShutdown)
+			if (StepDown >= ImmediateShutdown)
 				break;
-			Shutdown = ImmediateShutdown;
+			StepDown = ImmediateShutdown;
 			ereport(LOG,
 					(errmsg("received immediate shutdown request")));
 
@@ -2929,6 +3009,7 @@ pmdie(SIGNAL_ARGS)
 			break;
 	}
 
+out:
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -2967,10 +3048,11 @@ reaper(SIGNAL_ARGS)
 			StartupPID = 0;
 
 			/*
-			 * Startup process exited in response to a shutdown request (or it
-			 * completed normally regardless of the shutdown request).
+			 * Startup process exited in response to a shutdown or demote
+			 * request (or it completed normally regardless of the shutdown
+			 * request).
 			 */
-			if (Shutdown > NoShutdown &&
+			if (StepDown > NoShutdown &&
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
@@ -2984,7 +3066,7 @@ reaper(SIGNAL_ARGS)
 				ereport(LOG,
 						(errmsg("shutdown at recovery target")));
 				StartupStatus = STARTUP_NOT_RUNNING;
-				Shutdown = SmartShutdown;
+				StepDown = SmartShutdown;
 				TerminateChildren(SIGTERM);
 				pmState = PM_WAIT_BACKENDS;
 				/* PostmasterStateMachine logic does the rest */
@@ -3124,7 +3206,7 @@ reaper(SIGNAL_ARGS)
 				 * archive cycle and quit. Likewise, if we have walsender
 				 * processes, tell them to send any remaining WAL and quit.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 
 				/* Waken archiver for the last time */
 				if (PgArchPID != 0)
@@ -3484,7 +3566,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	take_action = !FatalError && StepDown != ImmediateShutdown;
 
 	if (take_action)
 	{
@@ -3702,7 +3784,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 
 	/* We do NOT restart the syslogger */
 
-	if (Shutdown != ImmediateShutdown)
+	if (StepDown != ImmediateShutdown)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
@@ -3845,11 +3927,11 @@ PostmasterStateMachine(void)
 			WalReceiverPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && StepDown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (StepDown >= ImmediateShutdown || FatalError)
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3863,6 +3945,15 @@ PostmasterStateMachine(void)
 				 * FatalError state.
 				 */
 			}
+			/* Handle demote signal */
+			else if (DemoteSignal)
+			{
+				ereport(LOG, (errmsg("all backend processes terminated; demoting")));
+
+				SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+				pmState = PM_STARTUP;
+				StepDown = NoShutdown;
+			}
 			else
 			{
 				/*
@@ -3870,7 +3961,7 @@ PostmasterStateMachine(void)
 				 * the regular children are gone, and it's time to tell the
 				 * checkpointer to do a shutdown checkpoint.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 				/* Start the checkpointer if not running */
 				if (CheckpointerPID == 0)
 					CheckpointerPID = StartCheckpointer();
@@ -3958,7 +4049,8 @@ PostmasterStateMachine(void)
 	 * EOF on its input pipe, which happens when there are no more upstream
 	 * processes.
 	 */
-	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+	if (pmState == PM_NO_CHILDREN && (StepDown == SmartShutdown ||
+		StepDown == FastShutdown || StepDown == ImmediateShutdown))
 	{
 		if (FatalError)
 		{
@@ -3991,7 +4083,7 @@ PostmasterStateMachine(void)
 	 * startup process fails, because more than likely it will just fail again
 	 * and we will keep trying forever.
 	 */
-	if (pmState == PM_NO_CHILDREN &&
+	if (pmState == PM_NO_CHILDREN && !DemoteSignal &&
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
@@ -5188,6 +5280,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		StartWorkerNeeded = true;
 	}
 
+	/* Demoting: start the Startup Process */
+	if (CheckPostmasterSignal(PMSIGNAL_DEMOTING) &&
+		pmState == PM_STARTUP && StepDown == NoShutdown)
+	{
+		if (!XLogArchivingAlways())
+			signal_child(PgArchPID, SIGQUIT);
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		StartupStatus = STARTUP_RUNNING;
+	}
+
 	/*
 	 * RECOVERY_STARTED and BEGIN_HOT_STANDBY signals are ignored in
 	 * unexpected states. If the startup process quickly starts up, completes
@@ -5195,7 +5298,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	 * first. We don't want to go back to recovery in that case.
 	 */
 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
-		pmState == PM_STARTUP && Shutdown == NoShutdown)
+		pmState == PM_STARTUP && StepDown == NoShutdown)
 	{
 		/* WAL redo has started. We're out of reinitialization. */
 		FatalError = false;
@@ -5205,17 +5308,27 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
-		Assert(CheckpointerPID == 0);
-		CheckpointerPID = StartCheckpointer();
-		Assert(BgWriterPID == 0);
-		BgWriterPID = StartBackgroundWriter();
+		if (!DemoteSignal)
+		{
+			Assert(CheckpointerPID == 0);
+			Assert(BgWriterPID == 0);
+			Assert(PgArchPID == 0);
+
+			CheckpointerPID = StartCheckpointer();
+		}
+		else
+		{
+			Assert(CheckpointerPID);
+		}
+
+		if (BgWriterPID == 0)
+			BgWriterPID = StartBackgroundWriter();
 
 		/*
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
 		/*
@@ -5226,6 +5339,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5234,13 +5348,15 @@ sigusr1_handler(SIGNAL_ARGS)
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
-		pmState == PM_RECOVERY && Shutdown == NoShutdown)
+		pmState == PM_RECOVERY && StepDown == NoShutdown)
 	{
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5252,6 +5368,7 @@ sigusr1_handler(SIGNAL_ARGS)
 #endif
 
 		pmState = PM_HOT_STANDBY;
+		DemoteSignal = false;
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
 	}
@@ -5284,7 +5401,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/*
 		 * Start one iteration of the autovacuum daemon, even if autovacuuming
@@ -5299,7 +5416,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/* The autovacuum launcher wants us to start a worker process. */
 		StartAutovacuumWorker();
@@ -5644,7 +5761,7 @@ MaybeStartWalReceiver(void)
 	if (WalReceiverPID == 0 &&
 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
 		if (WalReceiverPID != 0)
@@ -6647,3 +6764,18 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Check if a promote request appeared. Should be called by postmaster before
+ * shutting down.
+ */
+bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..1903f4db2a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..c144cc35d3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3c03ace7ed..79bb42f7e7 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,109 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+
+	snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send stop signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2452,6 +2558,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2559,6 +2667,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77ac4e785f..ff0119046e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..b38671ae52 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -76,6 +76,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY			0x90
 #define XLOG_FPI_FOR_HINT				0xA0
 #define XLOG_FPI						0xB0
+#define XLOG_CHECKPOINT_DEMOTE			0xC0
 
 
 /*
@@ -87,6 +88,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 179ebaa104..a9e27f009e 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK,
+	CAC_STARTUP,
+	CAC_DEMOTE,
+	CAC_SHUTDOWN,
+	CAC_RECOVERY,
+	CAC_TOOMANY,
 	CAC_WAITBACKUP
 } CAC_state;
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..1c5baf3b68 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_DEMOTING, /* restart startup process */
 
 	NUM_PMSIGNALS				/* Must be last value of enum! */
 } PMSignalReason;
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..eb0bda04f5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,7 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#19)
1 attachment(s)
Re: [patch] demote

Hello.

At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Here is a summary of my work during the last few days on this demote approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further after
being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running checkpoint?
Or leave it to its synching work but discard the result without wirting it to
XLog?

If we are going to dive so close to server shutdown, we can just
utilize the restart-after-crash path, which we can assume to work
reliably. The attached is a quite rough sketch, hijacking smart
shutdown path for a convenience, of that but seems working. "pg_ctl
-m s -W stop" lets server demote.

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch does all required clenaup of resources including shared
memory, I believe. It's enough if we don't need to keep any resources
alive?

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

simple-demote.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..a4adf3e587 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2752,6 +2752,7 @@ SIGHUP_handler(SIGNAL_ARGS)
 /*
  * pmdie -- signal handler for processing various postmaster signals.
  */
+static bool		demoting = false;
 static void
 pmdie(SIGNAL_ARGS)
 {
@@ -2774,59 +2775,17 @@ pmdie(SIGNAL_ARGS)
 		case SIGTERM:
 
 			/*
-			 * Smart Shutdown:
+			 * XXX: Hijacked as DEMOTE
 			 *
-			 * Wait for children to end their work, then shut down.
+			 * Runs fast shutdown, then restart as standby
 			 */
 			if (Shutdown >= SmartShutdown)
 				break;
 			Shutdown = SmartShutdown;
 			ereport(LOG,
-					(errmsg("received smart shutdown request")));
-
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
-#ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
-#endif
-
-			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
-				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
-			{
-				/* autovac workers are told to shut down immediately */
-				/* and bgworkers too; does this need tweaking? */
-				SignalSomeChildren(SIGTERM,
-								   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
-				/* and the autovac launcher too */
-				if (AutoVacPID != 0)
-					signal_child(AutoVacPID, SIGTERM);
-				/* and the bgwriter too */
-				if (BgWriterPID != 0)
-					signal_child(BgWriterPID, SIGTERM);
-				/* and the walwriter too */
-				if (WalWriterPID != 0)
-					signal_child(WalWriterPID, SIGTERM);
-
-				/*
-				 * If we're in recovery, we can't kill the startup process
-				 * right away, because at present doing so does not release
-				 * its locks.  We might want to change this in a future
-				 * release.  For the time being, the PM_WAIT_READONLY state
-				 * indicates that we're waiting for the regular (read only)
-				 * backends to die off; once they do, we'll kill the startup
-				 * and walreceiver processes.
-				 */
-				pmState = (pmState == PM_RUN) ?
-					PM_WAIT_BACKUP : PM_WAIT_READONLY;
-			}
-
-			/*
-			 * Now wait for online backup mode to end and backends to exit. If
-			 * that is already the case, PostmasterStateMachine will take the
-			 * next step.
-			 */
-			PostmasterStateMachine();
-			break;
+					(errmsg("received demote request")));
+			demoting = true;
+			/* FALL THROUGH */
 
 		case SIGINT:
 
@@ -2839,8 +2798,10 @@ pmdie(SIGNAL_ARGS)
 			if (Shutdown >= FastShutdown)
 				break;
 			Shutdown = FastShutdown;
-			ereport(LOG,
-					(errmsg("received fast shutdown request")));
+
+			if (!demoting)
+				ereport(LOG,
+						(errmsg("received fast shutdown request")));
 
 			/* Report status */
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
@@ -2887,6 +2848,13 @@ pmdie(SIGNAL_ARGS)
 				pmState = PM_WAIT_BACKENDS;
 			}
 
+			/* create standby signal file */
+			{
+				FILE *standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+
+				Assert (standby_file && !FreeFile(standby_file));
+			}
+
 			/*
 			 * Now wait for backends to exit.  If there are none,
 			 * PostmasterStateMachine will take the next step.
@@ -3958,7 +3926,7 @@ PostmasterStateMachine(void)
 	 * EOF on its input pipe, which happens when there are no more upstream
 	 * processes.
 	 */
-	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+	if (!demoting && Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
 	{
 		if (FatalError)
 		{
@@ -3996,13 +3964,23 @@ PostmasterStateMachine(void)
 		ExitPostmaster(1);
 
 	/*
-	 * If we need to recover from a crash, wait for all non-syslogger children
-	 * to exit, then reset shmem and StartupDataBase.
+	 * If we need to recover from a crash or demoting, wait for all
+	 * non-syslogger children to exit, then reset shmem and StartupDataBase.
 	 */
-	if (FatalError && pmState == PM_NO_CHILDREN)
+	if ((demoting || FatalError) && pmState == PM_NO_CHILDREN)
 	{
-		ereport(LOG,
-				(errmsg("all server processes terminated; reinitializing")));
+		if (demoting)
+			ereport(LOG,
+					(errmsg("all server processes terminated; starting as standby")));
+		else
+			ereport(LOG,
+					(errmsg("all server processes terminated; reinitializing")));
+
+		if (demoting)
+		{
+			Shutdown = NoShutdown;
+			demoting = false;
+		}
 
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
#21Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#20)
Re: [patch] demote

Mmm. Fat finger..

At Fri, 26 Jun 2020 16:14:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Hello.

If we are going to dive so close to server shutdown, we can just
utilize the restart-after-crash path, which we can assume to work
reliably. The attached is a quite rough sketch, hijacking smart
shutdown path for a convenience, of that but seems working. "pg_ctl
-m s -W stop" lets server demote.

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch does all required clenaup of resources including shared

The path does all required clenaup of..

memory, I believe. It's enough if we don't need to keep any resources
alive?

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Kyotaro Horiguchi (#20)
Re: [patch] demote

On Fri, 26 Jun 2020 16:14:38 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Hello.

At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Here is a summary of my work during the last few days on this demote
approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further
after being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running
checkpoint? Or leave it to its synching work but discard the result without
wirting it to XLog?

If we are going to dive so close to server shutdown, we can just
utilize the restart-after-crash path, which we can assume to work
reliably. The attached is a quite rough sketch, hijacking smart
shutdown path for a convenience, of that but seems working. "pg_ctl
-m s -W stop" lets server demote.

This was actually my very first toy PoC.

However, resetting everything is far from a graceful demote I was seeking for.
Moreover, such a patch will not be able to evolve to eg. keep read only
backends around.

I hadn't time to investigate Robert's concern about shared memory for
snapshot during recovery.

The patch does all required clenaup of resources including shared
memory, I believe. It's enough if we don't need to keep any resources
alive?

Resetting everything might not be enough. If I understand Robert's concern
correctly, it might actually need more shmem for hot standby xact snapshot. Or
maybe some shmem init'ed differently.

Regards,

In reply to: Jehan-Guillaume de Rorthais (#19)
Re: [patch] demote

Hi,

Here is a small activity summary since last report.

On Thu, 25 Jun 2020 19:27:54 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

I hadn't time to dig very far, but I suppose this might be related to the
comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
already allocated as long as hot_standby is enabled. I realize it doesn't means
we are on the safe side of the fence though. I still have to have a better
understanding on this.

The patch doesn't deal with prepared xact yet. Testing
"start->demote->promote" raise an assert if some prepared xact exist. I
suppose I will rollback them during demote in next patch version.

Rollback all prepared transaction on demote seems easy. However, I realized
there's no point to cancel them. After the demote action, they might still be
committed later on a promoted instance.

I am currently trying to clean shared memory for existing prepared transaction
so they are handled by the startup process during recovery.
I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
process to clean remaining prepared xact locks.

Regards,

In reply to: Jehan-Guillaume de Rorthais (#23)
2 attachment(s)
Re: [patch] demote

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

The checkpointer is now shutdowned after the demote shutdown checkpoint. It
removes some useless code complexity, eg. avoiding to signal postmaster from
checkpointer to keep going with the demotion.

Cascaded replication is now supported. Wal senders stay actives during
demotion but set their local "am_cascading_walsender = true". It has been a
rough debug session (thank you rr and tests!) on my side, but it might deserve
it. I believe they should stay connected during the demote actions for futur
features, eg. triggering a switchover over the replication protocol using an
admin function.

The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add
some more tests in futur versions.

I believe the patch is ready for some preliminary tests and advice or
directions.

On my todo:

* study how to only disconnect or cancel active RW backends
* ...then add pg_demote() admin function
* cancel running checkpoint for fast demote ?
* user documentation
* Robert's concern about snapshot during hot standby
* some more coding style cleanup/refactoring
* anything else reported to me :)

Thanks,

On Fri, 3 Jul 2020 00:12:10 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi,

Here is a small activity summary since last report.

On Thu, 25 Jun 2020 19:27:54 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]

I hadn't time to investigate Robert's concern about shared memory for
snapshot during recovery.

I hadn't time to dig very far, but I suppose this might be related to the
comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
already allocated as long as hot_standby is enabled. I realize it doesn't
means we are on the safe side of the fence though. I still have to have a
better understanding on this.

The patch doesn't deal with prepared xact yet. Testing
"start->demote->promote" raise an assert if some prepared xact exist. I
suppose I will rollback them during demote in next patch version.

Rollback all prepared transaction on demote seems easy. However, I realized
there's no point to cancel them. After the demote action, they might still be
committed later on a promoted instance.

I am currently trying to clean shared memory for existing prepared transaction
so they are handled by the startup process during recovery.
I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
process to clean remaining prepared xact locks.

Regards,

--
Jehan-Guillaume de Rorthais
Dalibo

Attachments:

v3-0001-Support-demoting-instance-from-production-to-standby.patchtext/x-patchDownload
From 4470772702273c720cdea942ed229d59f3a70318 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 1/2] Support demoting instance from production to standby

Architecture:

* creates a shutdown checkpoint on demote
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, bgwriter and optionally archiver or wal senders
* the code currently use USR1 to signal the wal senders to check
  if they must enable the cascading mode
* ShutdownXLOG takes a boolean arg to handle demote differently
* the checkpointer is restarted for code simplicity

Trivial manual tests:

* make check: OK
* make check-world: OK
* start in production->demote->demote: OK
* start in production->demote->stop: OK
* start in production->demote-> promote: OK
* switch roles between primary and standby (switchover): OK
* commit and check 2PC after demote/promote
* commit and check 2PC after switchover

Discuss/Todo:

* cancel or kill active and idle in xact RW backends
  * keep RO backends
  * pg_demote() function?
* code reviewing, arch, analysis, checks, etc
* investigate snapshots shmem needs/init during recovery compare to
  production
* add doc
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  95 +++++++
 src/backend/access/transam/xlog.c       | 315 ++++++++++++++++--------
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/postmaster.c     | 261 +++++++++++++++-----
 src/backend/replication/walsender.c     |   1 +
 src/backend/storage/ipc/procarray.c     |   2 +
 src/backend/storage/ipc/procsignal.c    |   4 +
 src/backend/storage/lmgr/lock.c         |  12 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 111 +++++++++
 src/include/access/twophase.h           |   1 +
 src/include/access/xlog.h               |  19 +-
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   7 +-
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/lock.h              |   2 +
 src/include/storage/procsignal.h        |   1 +
 src/include/utils/pidfile.h             |   1 +
 18 files changed, 689 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0e..fda085631f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1565,6 +1565,101 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assue all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC	   *proc;
+		TransactionId xid;
+		char	   *buf;
+		char	   *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = ProcGlobal->allPgXact[gxact->pgprocno].xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+			   + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				+ MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->nabortrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									 (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 28daf72a50..3a52f7fde8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6301,6 +6301,11 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6324,6 +6329,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		fast_promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6388,6 +6394,25 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+
+			/*
+			 * previous state was RECOVERY_STATE_DONE. We need to
+			 * reinit it to something else so RecoveryInProgress()
+			 * doesn't return false.
+			 */
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6421,7 +6446,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6677,7 +6703,8 @@ StartupXLOG(void)
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) &&
+			!is_demoting;
 	}
 
 	/*
@@ -6739,9 +6766,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextFullXid),
@@ -6775,47 +6802,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
-
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
-
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
-
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
-
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
 
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6824,19 +6811,64 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
+
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
+
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
+
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
+
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
+
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -6891,11 +6923,25 @@ StartupXLOG(void)
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (is_demoting)
+			{
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				 */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+			{
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 
-			SpinLockAcquire(&XLogCtl->info_lck);
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-			SpinLockRelease(&XLogCtl->info_lck);
+				/* This is already set if demoting */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+			}
 		}
 		else
 		{
@@ -6985,7 +7031,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7047,7 +7094,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -7060,6 +7107,11 @@ StartupXLOG(void)
 			 * Startup commit log and subtrans only.  MultiXact and commit
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
+			 *
+			 * Starting up commit log is technicaly not needed during demote
+			 * as the in-memory data did not move. However, this is a
+			 * lightweight initialization and this is expected ShutdownCLOG()
+			 * is called during ShutdownXLog()
 			 */
 			StartupCLOG();
 			StartupSUBTRANS(oldestActiveXID);
@@ -7070,7 +7122,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7941,6 +7993,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8059,6 +8112,23 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Initialize the local TimeLineID
+ */
+bool
+SetLocalRecoveryInProgress(void)
+{
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+
+	return LocalRecoveryInProgress;
+}
+
 /*
  * Is the system still in recovery?
  *
@@ -8080,13 +8150,7 @@ RecoveryInProgress(void)
 		return false;
 	else
 	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+		SetLocalRecoveryInProgress();
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8487,6 +8551,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8496,35 +8562,55 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (is_demoting)
+	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		ShutdownPreparedTransactions();
+		LocalRecoveryInProgress = true;
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
+
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
 	ShutdownCLOG();
 	ShutdownCommitTs();
@@ -8538,9 +8624,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8676,6 +8763,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8704,6 +8792,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8716,14 +8805,21 @@ CreateCheckPoint(int flags)
 	int			nvxids;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery or demote checkpoint is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN |
+				 CHECKPOINT_IS_DEMOTE |
+				 CHECKPOINT_END_OF_RECOVERY))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8764,7 +8860,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8810,7 +8906,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8978,8 +9074,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -8998,11 +9094,11 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -9018,7 +9114,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9031,7 +9128,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..58473a61fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -151,6 +151,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		proc_exit(0);
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dec02586c7..60f159fcb6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,6 +150,9 @@
 
 #define BACKEND_TYPE_WORKER		(BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
+/* file to signal demotion from primary to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+
 /*
  * List of active backends (or child processes anyway; we don't actually
  * know whether a given child has become a backend or is still in the
@@ -269,18 +272,23 @@ typedef enum
 static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 
 /* Startup/shutdown state */
-#define			NoShutdown		0
-#define			SmartShutdown	1
-#define			FastShutdown	2
-#define			ImmediateShutdown	3
-
-static int	Shutdown = NoShutdown;
+typedef enum StepDownState {
+	NoShutdown = 0, /* find better label? */
+	SmartShutdown,
+	SmartDemote,
+	FastShutdown,
+	FastDemote,
+	ImmediateShutdown
+} StepDownState;
+
+static StepDownState StepDown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -314,7 +322,7 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * will not be very long).
  *
  * Notice that this state variable does not distinguish *why* we entered
- * states later than PM_RUN --- Shutdown and FatalError must be consulted
+ * states later than PM_RUN --- StepDown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
  * states, nor in PM_SHUTDOWN states (because we don't enter those states
  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
@@ -324,6 +332,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* demote action in progress */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -414,6 +423,8 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
@@ -1550,7 +1561,7 @@ DetermineSleepTime(struct timeval *timeout)
 	 * Normal case: either there are no background workers at all, or we're in
 	 * a shutdown sequence (during which we ignore bgworkers altogether).
 	 */
-	if (Shutdown > NoShutdown ||
+	if (StepDown > NoShutdown ||
 		(!StartWorkerNeeded && !HaveCrashedWorker))
 	{
 		if (AbortStartTime != 0)
@@ -1830,7 +1841,7 @@ ServerLoop(void)
 		 *
 		 * Note we also do this during recovery from a process crash.
 		 */
-		if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
+		if ((StepDown >= ImmediateShutdown || (FatalError && !SendStop)) &&
 			AbortStartTime != 0 &&
 			(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
 		{
@@ -2305,6 +2316,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2436,7 +2452,7 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
+	 * Can't start backends when in startup/demote/shutdown/inconsistent recovery
 	 * state.  We treat autovac workers the same as user backends for this
 	 * purpose.  However, bgworkers are excluded from this test; we expect
 	 * bgworker_should_start_now() decided whether the DB state allows them.
@@ -2452,7 +2468,9 @@ canAcceptConnections(int backend_type)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
-		else if (Shutdown > NoShutdown)
+		else if (StepDown == SmartDemote || StepDown == FastDemote)
+			return CAC_DEMOTE;	/* demote is pending */
+		else if (StepDown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
@@ -2683,7 +2701,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 	PG_SETMASK(&BlockSig);
 #endif
 
-	if (Shutdown <= SmartShutdown)
+	if (StepDown == NoShutdown || StepDown == SmartShutdown ||
+		StepDown == SmartDemote)
 	{
 		ereport(LOG,
 				(errmsg("received SIGHUP, reloading configuration files")));
@@ -2769,26 +2788,81 @@ pmdie(SIGNAL_ARGS)
 			(errmsg_internal("postmaster received signal %d",
 							 postgres_signal_arg)));
 
+	if (CheckDemoteSignal())
+	{
+		if (pmState != PM_RUN)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(LOG,
+					(errmsg("ignoring demote signal because already in standby mode")));
+			goto out;
+		}
+		else if (postgres_signal_arg == SIGQUIT)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(WARNING,
+					(errmsg("can not demote in immediate stop mode")));
+			goto out;
+		}
+		else
+		{
+			FILE	   *standby_file;
+
+			DemoteSignal = true;
+
+			unlink(DEMOTE_SIGNAL_FILE);
+
+			/* create the standby signal file */
+			standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+			if (!standby_file)
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+
+			if (FreeFile(standby_file))
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+		}
+	}
+
 	switch (postgres_signal_arg)
 	{
 		case SIGTERM:
 
 			/*
-			 * Smart Shutdown:
+			 * Smart Stepdown:
 			 *
-			 * Wait for children to end their work, then shut down.
+			 * Wait for children to end their work, then shut down or demote.
 			 */
-			if (Shutdown >= SmartShutdown)
+			if (StepDown >= SmartShutdown)
 				break;
-			Shutdown = SmartShutdown;
-			ereport(LOG,
-					(errmsg("received smart shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = SmartDemote;
+				ereport(LOG, (errmsg("received smart demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = SmartShutdown;
+				ereport(LOG, (errmsg("received smart shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -2831,22 +2905,29 @@ pmdie(SIGNAL_ARGS)
 		case SIGINT:
 
 			/*
-			 * Fast Shutdown:
+			 * Fast StepDown:
 			 *
 			 * Abort all children with SIGTERM (rollback active transactions
-			 * and exit) and shut down when they are gone.
+			 * and exit) and shut down or demote when they are gone.
 			 */
-			if (Shutdown >= FastShutdown)
+			if (StepDown >= FastShutdown)
 				break;
-			Shutdown = FastShutdown;
-			ereport(LOG,
-					(errmsg("received fast shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = FastDemote;
+				ereport(LOG, (errmsg("received fast demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = FastShutdown;
+				ereport(LOG, (errmsg("received fast shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (StartupPID != 0)
 				signal_child(StartupPID, SIGTERM);
@@ -2903,9 +2984,9 @@ pmdie(SIGNAL_ARGS)
 			 * terminate remaining ones with SIGKILL, then exit without
 			 * attempt to properly shut down the data base system.
 			 */
-			if (Shutdown >= ImmediateShutdown)
+			if (StepDown >= ImmediateShutdown)
 				break;
-			Shutdown = ImmediateShutdown;
+			StepDown = ImmediateShutdown;
 			ereport(LOG,
 					(errmsg("received immediate shutdown request")));
 
@@ -2929,6 +3010,7 @@ pmdie(SIGNAL_ARGS)
 			break;
 	}
 
+out:
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -2967,10 +3049,11 @@ reaper(SIGNAL_ARGS)
 			StartupPID = 0;
 
 			/*
-			 * Startup process exited in response to a shutdown request (or it
-			 * completed normally regardless of the shutdown request).
+			 * Startup process exited in response to a shutdown or demote
+			 * request (or it completed normally regardless of the shutdown
+			 * request).
 			 */
-			if (Shutdown > NoShutdown &&
+			if (StepDown > NoShutdown &&
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
@@ -2984,7 +3067,7 @@ reaper(SIGNAL_ARGS)
 				ereport(LOG,
 						(errmsg("shutdown at recovery target")));
 				StartupStatus = STARTUP_NOT_RUNNING;
-				Shutdown = SmartShutdown;
+				StepDown = SmartShutdown;
 				TerminateChildren(SIGTERM);
 				pmState = PM_WAIT_BACKENDS;
 				/* PostmasterStateMachine logic does the rest */
@@ -3124,7 +3207,7 @@ reaper(SIGNAL_ARGS)
 				 * archive cycle and quit. Likewise, if we have walsender
 				 * processes, tell them to send any remaining WAL and quit.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 
 				/* Waken archiver for the last time */
 				if (PgArchPID != 0)
@@ -3145,6 +3228,18 @@ reaper(SIGNAL_ARGS)
 				if (PgStatPID != 0)
 					signal_child(PgStatPID, SIGQUIT);
 			}
+			else if (EXIT_STATUS_0(exitstatus) &&
+					 DemoteSignal &&
+					 pmState == PM_DEMOTING)
+			{
+				/*
+				 * The checkpointer exit signals the demote shutdown checkpoint
+				 * is done. The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+				StepDown = NoShutdown;
+			}
 			else
 			{
 				/*
@@ -3484,7 +3579,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	take_action = !FatalError && StepDown != ImmediateShutdown;
 
 	if (take_action)
 	{
@@ -3702,7 +3797,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 
 	/* We do NOT restart the syslogger */
 
-	if (Shutdown != ImmediateShutdown)
+	if (StepDown != ImmediateShutdown)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
@@ -3845,11 +3940,11 @@ PostmasterStateMachine(void)
 			WalReceiverPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && StepDown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (StepDown >= ImmediateShutdown || FatalError)
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3863,6 +3958,14 @@ PostmasterStateMachine(void)
 				 * FatalError state.
 				 */
 			}
+			/* Handle demote signal */
+			else if (DemoteSignal)
+			{
+				ereport(LOG, (errmsg("all backend processes terminated; demoting")));
+
+				SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+				pmState = PM_DEMOTING;
+			}
 			else
 			{
 				/*
@@ -3870,7 +3973,7 @@ PostmasterStateMachine(void)
 				 * the regular children are gone, and it's time to tell the
 				 * checkpointer to do a shutdown checkpoint.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 				/* Start the checkpointer if not running */
 				if (CheckpointerPID == 0)
 					CheckpointerPID = StartCheckpointer();
@@ -3958,7 +4061,8 @@ PostmasterStateMachine(void)
 	 * EOF on its input pipe, which happens when there are no more upstream
 	 * processes.
 	 */
-	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+	if (pmState == PM_NO_CHILDREN && (StepDown == SmartShutdown ||
+		StepDown == FastShutdown || StepDown == ImmediateShutdown))
 	{
 		if (FatalError)
 		{
@@ -3991,10 +4095,23 @@ PostmasterStateMachine(void)
 	 * startup process fails, because more than likely it will just fail again
 	 * and we will keep trying forever.
 	 */
-	if (pmState == PM_NO_CHILDREN &&
+	if (pmState == PM_NO_CHILDREN && !DemoteSignal &&
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+
+	/* Demoting: start the Startup Process */
+	if (pmState == PM_DEMOTING && StepDown == NoShutdown)
+	{
+		if (!XLogArchivingAlways() && PgArchPID != 0)
+			signal_child(PgArchPID, SIGQUIT);
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		pmState = PM_STARTUP;
+		StartupStatus = STARTUP_RUNNING;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
@@ -5195,7 +5312,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	 * first. We don't want to go back to recovery in that case.
 	 */
 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
-		pmState == PM_STARTUP && Shutdown == NoShutdown)
+		pmState == PM_STARTUP && StepDown == NoShutdown)
 	{
 		/* WAL redo has started. We're out of reinitialization. */
 		FatalError = false;
@@ -5205,19 +5322,29 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
+		if (!DemoteSignal)
+		{
+			Assert(BgWriterPID == 0);
+			Assert(PgArchPID == 0);
+		}
+
 		Assert(CheckpointerPID == 0);
 		CheckpointerPID = StartCheckpointer();
-		Assert(BgWriterPID == 0);
-		BgWriterPID = StartBackgroundWriter();
+
+		if (BgWriterPID == 0)
+			BgWriterPID = StartBackgroundWriter();
 
 		/*
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
+		if (DemoteSignal) {
+			SignalSomeChildren(SIGHUP, BACKEND_TYPE_WALSND);
+		}
+
 		/*
 		 * If we aren't planning to enter hot standby mode later, treat
 		 * RECOVERY_STARTED as meaning we're out of startup, and report status
@@ -5226,6 +5353,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5234,13 +5362,15 @@ sigusr1_handler(SIGNAL_ARGS)
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
-		pmState == PM_RECOVERY && Shutdown == NoShutdown)
+		pmState == PM_RECOVERY && StepDown == NoShutdown)
 	{
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5252,6 +5382,7 @@ sigusr1_handler(SIGNAL_ARGS)
 #endif
 
 		pmState = PM_HOT_STANDBY;
+		DemoteSignal = false;
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
 	}
@@ -5284,7 +5415,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/*
 		 * Start one iteration of the autovacuum daemon, even if autovacuuming
@@ -5299,7 +5430,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/* The autovacuum launcher wants us to start a worker process. */
 		StartAutovacuumWorker();
@@ -5644,7 +5775,7 @@ MaybeStartWalReceiver(void)
 	if (WalReceiverPID == 0 &&
 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
 		if (WalReceiverPID != 0)
@@ -5899,6 +6030,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
@@ -6647,3 +6779,18 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Check if a promote request appeared. Should be called by postmaster before
+ * shutting down.
+ */
+bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210dd7b..9a2bff7e5e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2267,6 +2267,7 @@ WalSndLoop(WalSndSendDataCallback send_data)
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			SyncRepInitConfig();
+			am_cascading_walsender = SetLocalRecoveryInProgress();
 		}
 
 		/* Check for input from the client */
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b448533564..0ccc32f4ce 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -191,6 +191,8 @@ ProcArrayShmemSize(void)
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
+	 * TODO demote: check safe hotStandby related init and snapshot mech.
+	 *
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
 	 * also created in various backends during GetSnapshotData(),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..1903f4db2a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79b..52f85cd1b3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4371,6 +4371,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+						void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..c144cc35d3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3c03ace7ed..79bb42f7e7 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,109 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+
+	snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send stop signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2452,6 +2558,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2559,6 +2667,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..4b56f92181 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void);
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334887..be5e96e437 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -300,6 +302,7 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool SetLocalRecoveryInProgress(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..f529f8c7bd 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 179ebaa104..a9e27f009e 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK,
+	CAC_STARTUP,
+	CAC_DEMOTE,
+	CAC_SHUTDOWN,
+	CAC_RECOVERY,
+	CAC_TOOMANY,
 	CAC_WAITBACKUP
 } CAC_state;
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fdabf42721..d3b08163a2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+									void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..eb0bda04f5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,7 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

v3-0002-Add-various-tests-related-to-demote-and-promote-acti.patchtext/x-patchDownload
From b548e865e5d0532a03416cbc8db923c1a2f2f01e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Jul 2020 02:00:38 +0200
Subject: [PATCH 2/2] Add various tests related to demote and promote actions

* demote/promote with a standby replicating from the node
* make sure 2PC survive a demote/promote cycle
* commit 2PC and check the result
* swap roles between primary and standby
* commit a 2PC on the new primary
---
 src/test/perl/PostgresNode.pm             |  25 +++++
 src/test/recovery/t/021_promote-demote.pl | 129 ++++++++++++++++++++++
 2 files changed, 154 insertions(+)
 create mode 100644 src/test/recovery/t/021_promote-demote.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8c1b77376f..4488365ffc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -906,6 +906,31 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self, $mode) = @_;
+	my $port    = $self->port;
+	my $pgdata  = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name    = $self->name;
+
+	$mode = 'fast' unless defined $mode;
+
+	print "### Demoting node \"$name\" using mode $mode\n";
+
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+		'-m', $mode, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl
new file mode 100644
index 0000000000..04e2207470
--- /dev/null
+++ b/src/test/recovery/t/021_promote-demote.pl
@@ -0,0 +1,129 @@
+# Test demote/promote actions in various scenarios using two
+# nodes alpha and beta. We check proper actions results,
+# correct data replication accros multiple demote/promote,
+# manual switchover.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = get_new_node('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+	log_checkpoints = true
+	log_replication_commands = true
+
+));
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = get_new_node('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+
+# Create some 2PC on alpha for futur tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# Demote alpha. beta should keep streaming from it as a
+# cascaded standby.
+$node_alpha->demote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'node alpha demoted to standby' );
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_beta->name, 'beta is still replicating with alpha after demote' );
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node alpha promoted" );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" );
+
+# Commit one 2PC and check it on alpha and beta
+$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'");
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT string_agg(i::text, ',' order by i asc) FROM new"),
+	'1,2,3,4,5', "prepared transaction 'pxact1' commited" );
+
+$node_alpha->wait_for_catchup($node_beta);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT string_agg(i::text, ',' order by i asc) FROM new"),
+	'1,2,3,4,5', "prepared transaction 'pxact1' replicated to beta" );
+
+# swap roles between alpha and beta
+# demote and check alpha
+$node_alpha->demote;
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted" );
+
+# fetch the last REDO location from alpha and chek beta received everyting
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]);
+$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg;
+my $redo_loc = $1;
+
+is( $node_beta->safe_psql(
+		'postgres',
+		"SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "),
+	't', "node beta received the demote checkpoint from alpha" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', "prepared transactions pxact2' exists" );
+
+# commit the second 2PC and check its result on both nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' commited" );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to alpha" );
-- 
2.20.1

#25Amul Sul
sulamul@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#24)
Re: [patch] demote

On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

Wondering is it necessary to clear prepared transactions from shared memory?
Can't simply skip clearing and restoring prepared transactions while demoting?

Regards,
Amul

#26Andres Freund
andres@anarazel.de
In reply to: Amul Sul (#25)
Re: [patch] demote

Hi,

On 2020-07-14 17:26:37 +0530, Amul Sul wrote:

On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

Wondering is it necessary to clear prepared transactions from shared memory?
Can't simply skip clearing and restoring prepared transactions while demoting?

Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
transaction state. So I don't think it'd be correct to leave them around
in their previous state. We'd likely end up with incorrect snapshots
if a demoted node later gets promoted...

Greetings,

Andres Freund

In reply to: Andres Freund (#26)
Re: [patch] demote

On Tue, 14 Jul 2020 12:49:51 -0700
Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-07-14 17:26:37 +0530, Amul Sul wrote:

On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during
demote/promote actions so they can be committed/rollbacked later on a
primary. See tests.

Wondering is it necessary to clear prepared transactions from shared memory?
Can't simply skip clearing and restoring prepared transactions while
demoting?

Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
transaction state. So I don't think it'd be correct to leave them around
in their previous state. We'd likely end up with incorrect snapshots
if a demoted node later gets promoted...

Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to
be cleared.

In reply to: Jehan-Guillaume de Rorthais (#24)
4 attachment(s)
Re: [patch] demote

Hi,

Yet another summary + patch + tests.

Demote now keeps backends with no active xid alive. Smart mode keeps all
backends: it waits for them to finish their xact and enter read-only. Fast
mode terminate backends wit an active xid and keeps all other ones.
Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
(check recovery state) once demoted.
During demote, no new session is allowed.

As backends with no active xid survive, a new SQL admin function
"pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
The resulting refactoring makes the code much simpler, cleaner, with better
isolation of actions from the code point of view.

Thanks to the refactoring, the patch now only adds one state to the state
machine: PM_DEMOTING. A second one could be use to replace:

/* Demoting: start the Startup Process */
if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)

with eg.:

if (pmState == PM_DEMOTED)

I believe it might be a bit simpler to understand, but the existing comment
might be good enough as well. The full state machine path for demote is:

PM_DEMOTING /* wait for active xid backend to finish */
PM_SHUTDOWN /* wait for checkpoint shutdown and its
various shutdown tasks */
PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
PM_STARTUP

Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
adding tests on backend behaviors during demote and new function pg_demote().

On my todo:

* cancel running checkpoint for fast demote ?
* forbid demote when PITR backup is in progress
* user documentation
* Robert's concern about snapshot during hot standby
* anything else reported to me

Plus, I might be able to split the backend part and their signals of the patch
0002 in its own patch if it helps the review. It would apply after 0001 and
before actual 0002.

As there was no consensus and the discussions seemed to conclude this patch set
should keep growing to see were it goes, I wonder if/when I should add it to
the commitfest. Advice? Opinion?

Regards,

Attachments:

v4-0001-demote-setter-functions-for-LocalXLogInsert-local-va.patchtext/x-patchDownload
From da3c4575f8ea40c089483b9cfa209db4993148ff Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++++++++++++++++++++++----
 src/include/access/xlog.h         |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 756b838e6a..25a9f78690 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..8c9cadc6da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void LocalSetXLogInsertNotAllowed(void);
+extern void LocalSetXLogInsertCheckRecovery(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
-- 
2.20.1

v4-0002-demote-support-demoting-instance-from-production-to-.patchtext/x-patchDownload
From e25f4699deba41025e05ddf0d85755ad6dc8917e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 2/4] demote: support demoting instance from production to
 standby

Procedure and Architecture:

* demote can be triggered using "pg_ctl [-m {fast|smart}] demote"
  or using the SQL admin function
  "pg_demote(fast bool, wait bool, wait_seconds int)"
* on sigusr1, postmaster check for demote signal files
  "demote" or "demote_fast" in PGDATA
* if in production and a demote signal file is found, the demote
  process starts by setting PM_DEMOTING
* set DB_DEMOTING state in controlfile
* PM_DEMOTING waits for all backends to be in read only
* every idle backends set LocalXLogInsert=0 immediatly to
  forbid new writes
* in fast mode, every backend holding a xid is terminated
* in smart mode, wait for running xact to finish
* once all backends are read only, set PM_SHUTDOWN to create a
  shutdown checkpoint
* during the shutdown chechpoint, ShutdownXLOG now takes a boolean
  arg to handle demote differently than a normal shutdown
* the shutdown checkpoint set the cluster state as DB_DEMOTING in
  the controlfile
* the checkpointer then exits to have a fresh restart for code
  simplicity
* Postmaster sets PM_STARTUP on checkpointer exit status
* the startup process is then started from PostmasterStateMachine()
  and try to handle subsystems init correctly during demote
* the demote procedure keeps some sub-processes alive:
  stat collector, bgwriter and optionally archiver and wal senders
* at the end of demote, send USR1 to signal the backends and wal
  senders to set their environment as in recovery and cascading

Discuss/Todo:

* add doc
* code reviewing
* do not handle backup in progress during demote
* investigate snapshots shmem needs/init during recovery compare to
  production
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  95 +++++++
 src/backend/access/transam/xlog.c       | 315 ++++++++++++++++--------
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/postmaster.c     | 235 +++++++++++++++++-
 src/backend/storage/ipc/procarray.c     |   2 +
 src/backend/storage/ipc/procsignal.c    |  30 +++
 src/backend/storage/lmgr/lock.c         |  12 +
 src/backend/tcop/postgres.c             |  44 ++++
 src/backend/utils/init/globals.c        |   1 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 117 +++++++++
 src/include/access/twophase.h           |   1 +
 src/include/access/xlog.h               |  23 +-
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   7 +-
 src/include/miscadmin.h                 |   1 +
 src/include/pgstat.h                    |   1 +
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/lock.h              |   2 +
 src/include/storage/procsignal.h        |   4 +
 src/include/tcop/tcopprot.h             |   2 +
 src/include/utils/pidfile.h             |   1 +
 23 files changed, 799 insertions(+), 129 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0e..fda085631f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1565,6 +1565,101 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assue all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC	   *proc;
+		TransactionId xid;
+		char	   *buf;
+		char	   *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = ProcGlobal->allPgXact[gxact->pgprocno].xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+			   + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				+ MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->nabortrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									 (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25a9f78690..4aaa138b1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6298,6 +6298,11 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6321,6 +6326,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6385,6 +6391,25 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+
+			/*
+			 * previous state was RECOVERY_STATE_DONE. We need to
+			 * reinit it to something else so RecoveryInProgress()
+			 * doesn't return false.
+			 */
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6418,7 +6443,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6674,7 +6700,8 @@ StartupXLOG(void)
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) &&
+			!is_demoting;
 	}
 
 	/*
@@ -6736,9 +6763,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextFullXid),
@@ -6772,47 +6799,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
-
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
-
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
-
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
-
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
 
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6821,19 +6808,64 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
+
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
+
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
+
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
+
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
+
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -6888,11 +6920,25 @@ StartupXLOG(void)
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (is_demoting)
+			{
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				 */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+			{
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 
-			SpinLockAcquire(&XLogCtl->info_lck);
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-			SpinLockRelease(&XLogCtl->info_lck);
+				/* This is already set if demoting */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+			}
 		}
 		else
 		{
@@ -6982,7 +7028,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7044,7 +7091,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -7057,6 +7104,11 @@ StartupXLOG(void)
 			 * Startup commit log and subtrans only.  MultiXact and commit
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
+			 *
+			 * Starting up commit log is technicaly not needed during demote
+			 * as the in-memory data did not move. However, this is a
+			 * lightweight initialization and this might seem expected as
+			 * pure symmetry as ShutdownCLOG() is called during ShutdownXLog().
 			 */
 			StartupCLOG();
 			StartupSUBTRANS(oldestActiveXID);
@@ -7067,7 +7119,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7938,6 +7990,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8056,6 +8109,23 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Initialize the local TimeLineID
+ */
+bool
+SetLocalRecoveryInProgress(void)
+{
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+
+	return LocalRecoveryInProgress;
+}
+
 /*
  * Is the system still in recovery?
  *
@@ -8077,13 +8147,7 @@ RecoveryInProgress(void)
 		return false;
 	else
 	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+		SetLocalRecoveryInProgress();
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8503,6 +8567,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8512,35 +8578,55 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (is_demoting)
+	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		ShutdownPreparedTransactions();
+		LocalRecoveryInProgress = true;
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
+
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
 	ShutdownCLOG();
 	ShutdownCommitTs();
@@ -8554,9 +8640,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8692,6 +8779,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8720,6 +8808,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8732,14 +8821,21 @@ CreateCheckPoint(int flags)
 	int			nvxids;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery or demote checkpoint is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN |
+				 CHECKPOINT_IS_DEMOTE |
+				 CHECKPOINT_END_OF_RECOVERY))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8780,7 +8876,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8826,7 +8922,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8994,8 +9090,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -9014,11 +9110,11 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -9034,7 +9130,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9047,7 +9144,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..58473a61fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -151,6 +151,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		proc_exit(0);
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 15f92b66c6..d20b8d8530 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3854,6 +3854,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PROMOTE:
 			event_name = "Promote";
 			break;
+		case WAIT_EVENT_DEMOTE:
+			event_name = "Demote";
+			break;
 		case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT:
 			event_name = "RecoveryConflictSnapshot";
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b5fc97c72..8004770d8c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,12 +275,13 @@ static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 #define			ImmediateShutdown	3
 
 static int	Shutdown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -324,6 +325,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* waiting for idle or RO backends for demote */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -414,10 +416,14 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static void RemoveDemoteSignalFiles(void);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
+static int	CountXacts(void);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
@@ -2305,6 +2311,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2436,10 +2447,11 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
-	 * state.  We treat autovac workers the same as user backends for this
-	 * purpose.  However, bgworkers are excluded from this test; we expect
-	 * bgworker_should_start_now() decided whether the DB state allows them.
+	 * Can't start backends when in startup/demote/shutdown/inconsistent
+	 * recovery state.  We treat autovac workers the same as user backends
+	 * for this purpose.  However, bgworkers are excluded from this test;
+	 * we expect bgworker_should_start_now() decided whether the DB state
+	 * allows them.
 	 *
 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
 	 * allowed so that a superuser can end online backup mode); we return
@@ -2452,6 +2464,8 @@ canAcceptConnections(int backend_type)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
+		else if (DemoteSignal)
+			return CAC_DEMOTE;	/* demote is pending */
 		else if (Shutdown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		else if (!FatalError &&
@@ -3108,7 +3122,18 @@ reaper(SIGNAL_ARGS)
 		if (pid == CheckpointerPID)
 		{
 			CheckpointerPID = 0;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) &&
+					 DemoteSignal &&
+					 pmState == PM_SHUTDOWN)
+			{
+				/*
+				 * The checkpointer exit signals the demote shutdown checkpoint
+				 * is done. The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+			}
+			else if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -3802,6 +3827,25 @@ PostmasterStateMachine(void)
 			pmState = PM_WAIT_BACKENDS;
 	}
 
+	if (pmState == PM_DEMOTING)
+	{
+		int numXacts = CountXacts();
+
+		/*
+		 * PM_DEMOTING state ends when we have no active transactions
+		 * and all backends set LocalXLogInsertAllowed=0
+		 */
+		if (numXacts == 0)
+		{
+			ereport(LOG, (errmsg("all backends in read only")));
+
+			SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+			pmState = PM_SHUTDOWN;
+		}
+		else
+			ereport(LOG, (errmsg("waiting for %d transactions to finish", numXacts)));
+	}
+
 	if (pmState == PM_WAIT_READONLY)
 	{
 		/*
@@ -3995,6 +4039,20 @@ PostmasterStateMachine(void)
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+
+	/* Demoting: start the Startup Process */
+	if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
+	{
+		/* stop archiver process if not required during standby */
+		if (!XLogArchivingAlways() && PgArchPID != 0)
+			signal_child(PgArchPID, SIGQUIT);
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		StartupStatus = STARTUP_RUNNING;
+		pmState = PM_STARTUP;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
@@ -5205,8 +5263,12 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
+		if (!DemoteSignal)
+			Assert(PgArchPID == 0);
+
 		Assert(CheckpointerPID == 0);
 		CheckpointerPID = StartCheckpointer();
+
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
@@ -5214,8 +5276,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
 		/*
@@ -5226,6 +5287,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5236,11 +5298,15 @@ sigusr1_handler(SIGNAL_ARGS)
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
 		pmState == PM_RECOVERY && Shutdown == NoShutdown)
 	{
+		dlist_iter	iter;
+
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5251,7 +5317,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		sd_notify(0, "READY=1");
 #endif
 
+		if (DemoteSignal)
+			dlist_foreach(iter, &BackendList)
+			{
+				Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+				if (!bp->dead_end && bp->bkend_type & (BACKEND_TYPE_NORMAL|BACKEND_TYPE_WALSND))
+					SendProcSignal(bp->pid, PROCSIG_DEMOTED, InvalidBackendId);
+			}
+
 		pmState = PM_HOT_STANDBY;
+		DemoteSignal = false;
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
 	}
@@ -5342,6 +5418,97 @@ sigusr1_handler(SIGNAL_ARGS)
 		signal_child(StartupPID, SIGUSR2);
 	}
 
+	if (CheckDemoteSignal() && pmState != PM_RUN )
+	{
+		DemoteSignal = false;
+		RemoveDemoteSignalFiles();
+		ereport(LOG,
+				(errmsg("ignoring demote signal because already in standby mode")));
+	}
+	/* received demote signal */
+	else if (CheckDemoteSignal())
+	{
+		FILE	   *standby_file;
+		dlist_iter	iter;
+		bool fast_demote;
+		struct stat stat_buf;
+
+		fast_demote = (stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0);
+
+		DemoteSignal = true;
+		RemoveDemoteSignalFiles();
+
+		/* create the standby signal file */
+		standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+		if (!standby_file)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			goto out;
+		}
+
+		if (FreeFile(standby_file))
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			unlink(STANDBY_SIGNAL_FILE);
+			goto out;
+		}
+
+		if (fast_demote == 0)
+		{
+			/* smart demote */
+			ereport(LOG, (errmsg("received smart demote request")));
+
+		}
+		else
+		{
+			/* fast demote */
+			ereport(LOG, (errmsg("received fast demote request")));
+		}
+
+		SignalSomeChildren(SIGTERM,
+						   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+
+		/* and the autovac launcher too */
+		if (AutoVacPID != 0)
+			signal_child(AutoVacPID, SIGTERM);
+		/* and the bgwriter too */
+		if (BgWriterPID != 0)
+			signal_child(BgWriterPID, SIGTERM);
+		/* and the walwriter too */
+		if (WalWriterPID != 0)
+			signal_child(WalWriterPID, SIGTERM);
+
+		dlist_foreach(iter, &BackendList)
+		{
+			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+			if (bp->dead_end)
+				continue;
+			/*
+			 * Assign bkend_type for any recently announced WAL Sender
+			 * processes.
+			 */
+			if (bp->bkend_type == BACKEND_TYPE_NORMAL &&
+				! IsPostmasterChildWalSender(bp->child_slot))
+				SendProcSignal(bp->pid,
+							   (fast_demote?PROCSIG_DEMOTING_FAST:PROCSIG_DEMOTING),
+							   InvalidBackendId);
+		}
+
+		pmState = PM_DEMOTING;
+
+		/* Report status */
+		AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+	}
+
+out:
+
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -5439,6 +5606,26 @@ CountChildren(int target)
 }
 
 
+/*
+ * Count up the number of active transactions
+ */
+static int
+CountXacts(void)
+{
+	int			i;
+	int			cnt = 0;
+
+	for (i = 0; i < ProcGlobal->allProcCount; ++i)
+	{
+		PGXACT   *xact = &ProcGlobal->allPgXact[i];
+		if (TransactionIdIsValid(xact->xid))
+			cnt++;
+	}
+
+	return cnt;
+}
+
+
 /*
  * StartChildProcess -- start an auxiliary process for the postmaster
  *
@@ -5904,6 +6091,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
@@ -6652,3 +6840,28 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Remove the files signaling a demote request.
+ */
+static void
+RemoveDemoteSignalFiles(void)
+{
+	unlink(DEMOTE_SIGNAL_FILE);
+	unlink(DEMOTE_FAST_SIGNAL_FILE);
+}
+
+/*
+ * Check if a demote request appeared.
+ */
+static bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0 ||
+		stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b448533564..0ccc32f4ce 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -191,6 +191,8 @@ ProcArrayShmemSize(void)
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
+	 * TODO demote: check safe hotStandby related init and snapshot mech.
+	 *
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
 	 * also created in various backends during GetSnapshotData(),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..ac14c662d3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,35 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	/* signal checkpoint process to ignite a demote procedure */
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 as soon as their active xact
+	 * finished
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING))
+		ReqDemoteHandler(PROCSIG_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 if they are idle, or
+	 * interrupt their current xact and terminate.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING_FAST))
+		ReqDemoteHandler(PROCSIG_DEMOTING_FAST);
+
+	/*
+	 * demote complete. Ask beckends to rely on
+	 * recovery status for LocalXLogInsertAllowed by
+	 * setting it to -1.
+	 * WAL sender set am_cascading.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTED))
+		ReqDemotedHandler(PROCSIG_DEMOTED);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79b..52f85cd1b3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4371,6 +4371,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+						void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..6bd1e1e1d0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinval.h"
@@ -3211,6 +3212,42 @@ ProcessInterrupts(void)
 		HandleParallelMessages();
 }
 
+/* SIGUSR1: set flag to demote */
+void
+ReqDemoteHandler(ProcSignalReason reason)
+{
+	if (MyBackendType != B_BACKEND)
+		return;
+
+	if (TransactionIdIsValid(MyPgXact->xid))
+	{
+		if (reason == PROCSIG_DEMOTING_FAST)
+		{
+			InterruptPending = true;
+			ProcDiePending = true;
+			SetLatch(MyLatch);
+		}
+		else
+			DemotePending = true;
+	}
+	else
+		LocalSetXLogInsertNotAllowed();
+}
+
+/* SIGUSR1: reset LocalRecoveryInProgress */
+void
+ReqDemotedHandler(ProcSignalReason reason)
+{
+	ereport(LOG,
+				(errmsg("received demote complete signal")));
+
+	SetLocalRecoveryInProgress();
+	LocalSetXLogInsertCheckRecovery();
+
+	if (MyBackendType == B_WAL_SENDER)
+		am_cascading_walsender = true;
+}
+
 
 /*
  * IA64-specific code to fetch the AR.BSP register for stack depth checks.
@@ -4224,6 +4261,12 @@ PostgresMain(int argc, char *argv[],
 				/* Send out notify signals and transmit self-notifies */
 				ProcessCompletedNotifies();
 
+				if (DemotePending) {
+					LocalSetXLogInsertNotAllowed();
+					DemotePending = false;
+					SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+				}
+
 				/*
 				 * Also process incoming notifies, if any.  This is mostly to
 				 * ensure stable behavior in tests: if any notifies were
@@ -4285,6 +4328,7 @@ PostgresMain(int argc, char *argv[],
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			SetLocalRecoveryInProgress();
 		}
 
 		/*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..021f6af434 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t DemotePending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..c144cc35d3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 1cdc3ebaa3..a7805bd219 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,115 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+	else if (shutdown_mode == FAST_MODE)
+		snprintf(demote_file, MAXPGPATH, "%s/demote_fast", pg_data);
+	else
+		snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	sig = SIGUSR1;
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send demote signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * FIXME demote
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2447,6 +2559,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2554,6 +2668,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..4b56f92181 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void);
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8c9cadc6da..b1b1ea67f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -301,6 +303,7 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool SetLocalRecoveryInProgress(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
@@ -397,4 +400,8 @@ extern SessionBackupState get_backup_status(void);
 /* files to signal promotion to primary */
 #define PROMOTE_SIGNAL_FILE		"promote"
 
+/* files to signal demotion to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+#define DEMOTE_FAST_SIGNAL_FILE	"demote_fast"
+
 #endif							/* XLOG_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..f529f8c7bd 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 179ebaa104..a9e27f009e 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK,
+	CAC_STARTUP,
+	CAC_DEMOTE,
+	CAC_SHUTDOWN,
+	CAC_RECOVERY,
+	CAC_TOOMANY,
 	CAC_WAITBACKUP
 } CAC_state;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..d60804208f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t DemotePending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..f1c0a37e76 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -880,6 +880,7 @@ typedef enum
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_PROC_SIGNAL_BARRIER,
 	WAIT_EVENT_PROMOTE,
+	WAIT_EVENT_DEMOTE,
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fdabf42721..d3b08163a2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+									void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..7264e9a705 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,10 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_DEMOTING,			/* ask backends to demote in smart mode */
+	PROCSIG_DEMOTING_FAST,		/* ask backends to demote in fast mode */
+	PROCSIG_DEMOTED,			/* ask backends to switch to recovery mode */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index bd30607b07..e5f42f9fec 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -68,6 +68,8 @@ extern void StatementCancelHandler(SIGNAL_ARGS);
 extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn();
 extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1
 																 * handler */
+extern void ReqDemoteHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
+extern void ReqDemotedHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
 extern void ProcessClientReadInterrupt(bool blocked);
 extern void ProcessClientWriteInterrupt(bool blocked);
 
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

v4-0003-demote-add-pg_demote-function.patchtext/x-patchDownload
From 673494349f497af71978985531a1fd44b8fc71c0 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 31 Jul 2020 18:07:38 +0200
Subject: [PATCH 3/4] demote: add pg_demote() function

---
 src/backend/access/transam/xlogfuncs.c | 94 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  6 ++
 src/include/catalog/pg_proc.dat        |  4 ++
 3 files changed, 104 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 290658b22c..733f465d38 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -784,3 +784,97 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Demotes a production server.
+ *
+ * A result of "true" means that demotion has been completed if "wait" is
+ * "true", or initiated if "wait" is false.
+ */
+Datum
+pg_demote(PG_FUNCTION_ARGS)
+{
+	bool		fast = PG_GETARG_BOOL(0);
+	bool		wait = PG_GETARG_BOOL(1);
+	int			wait_seconds = PG_GETARG_INT32(2);
+	char		demote_filename[] = "demote_fast";
+	FILE	   *demote_file;
+	int			i;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery in progress"),
+				 errhint("you can not demote while already in recovery.")));
+
+	if (!EnableHotStandby)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("function pg_demote() requires hot_standby parameter to be enabled"),
+				 errhint("The function can not return its status from a non hot_standby-enabled standby")));
+
+	if (wait_seconds <= 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"wait_seconds\" must not be negative or zero")));
+
+	if (!fast)
+		demote_filename[6] = '\0';
+
+	/* create the demote signal file */
+	demote_file = AllocateFile(demote_filename, "w");
+	if (!demote_file)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						demote_filename)));
+
+	if (FreeFile(demote_file))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write file \"%s\": %m",
+						demote_filename)));
+
+	/* signal the postmaster */
+	if (kill(PostmasterPid, SIGUSR1) != 0)
+	{
+		ereport(WARNING,
+				(errmsg("failed to send signal to postmaster: %m")));
+		(void) unlink(demote_filename);
+		PG_RETURN_BOOL(false);
+	}
+
+	/* return immediately if waiting was not requested */
+	if (!wait)
+		PG_RETURN_BOOL(true);
+
+	/* wait for the amount of time wanted until demotion */
+#define WAITS_PER_SECOND 10
+	for (i = 0; i < WAITS_PER_SECOND * wait_seconds; i++)
+	{
+		int			rc;
+
+		ResetLatch(MyLatch);
+
+		if (RecoveryInProgress())
+			PG_RETURN_BOOL(true);
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   1000L / WAITS_PER_SECOND,
+					   WAIT_EVENT_DEMOTE);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			PG_RETURN_BOOL(false);
+	}
+
+	ereport(WARNING,
+			(errmsg("server did not demote within %d seconds", wait_seconds)));
+	PG_RETURN_BOOL(false);
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbeab6..573d7b46eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1219,6 +1219,11 @@ CREATE OR REPLACE FUNCTION
   RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote'
   PARALLEL SAFE;
 
+CREATE OR REPLACE FUNCTION
+  pg_demote(fast boolean DEFAULT true, wait boolean DEFAULT true, wait_seconds integer DEFAULT 60)
+  RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_demote'
+  PARALLEL SAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
@@ -1435,6 +1440,7 @@ REVOKE EXECUTE ON FUNCTION pg_reload_conf() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_demote(boolean, boolean, integer) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_stat_reset() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 082a11f270..9e4d000d00 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6084,6 +6084,10 @@
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
   proargtypes => 'bool int4', proargnames => '{wait,wait_seconds}',
   prosrc => 'pg_promote' },
+{ oid => '8967', descr => 'demote production server',
+  proname => 'pg_demote', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'bool bool int4', proargnames => '{fast,wait,wait_seconds}',
+  prosrc => 'pg_demote' },
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
-- 
2.20.1

v4-0004-demote-add-various-tests-related-to-demote-and-promo.patchtext/x-patchDownload
From 4d0ad53e42f3385bab21588b7729008bcb10b6af Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Jul 2020 02:00:38 +0200
Subject: [PATCH 4/4] demote: add various tests related to demote and promote
 actions

* demote/promote with a standby replicating from the node
* make sure 2PC survive a demote/promote cycle
* commit 2PC and check the result
* swap roles between primary and standby
* make sure wal sender enters cascade mode
* commit a 2PC on the new primary
* confirm behavior of backends during smart/fast demote
---
 src/test/perl/PostgresNode.pm             |  25 ++
 src/test/recovery/t/021_promote-demote.pl | 287 ++++++++++++++++++++++
 2 files changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/021_promote-demote.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8c1b77376f..4488365ffc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -906,6 +906,31 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self, $mode) = @_;
+	my $port    = $self->port;
+	my $pgdata  = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name    = $self->name;
+
+	$mode = 'fast' unless defined $mode;
+
+	print "### Demoting node \"$name\" using mode $mode\n";
+
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+		'-m', $mode, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl
new file mode 100644
index 0000000000..245acfb211
--- /dev/null
+++ b/src/test/recovery/t/021_promote-demote.pl
@@ -0,0 +1,287 @@
+# Test demote/promote actions in various scenarios using three
+# nodes alpha, beta and gamma. We check proper actions results,
+# correct data replication and cascade across multiple
+# demote/promote, manual switchover, smart and fast demote.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = get_new_node('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+));
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = get_new_node('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+# Create node gamma from backup
+my $node_gamma = get_new_node('gamma');
+$node_gamma->init_from_backup($node_alpha, $backup_name);
+$node_gamma->enable_streaming($node_alpha);
+$node_gamma->start;
+
+# Create some 2PC on alpha for future tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# create an in idle in xact session
+my ($sess1_in, $sess1_out, $sess1_err) = ('', '', '');
+my $sess1 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess1_in,
+	'>', \$sess1_out,
+	'2>', \$sess1_err);
+
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_aborted (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/[[:digit:]]+[\r\n]$/m;
+my $sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# create an in idle session
+my ($sess2_in, $sess2_out, $sess2_err) = ('', '', '');
+my $sess2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess2_in,
+	'>', \$sess2_out,
+	'2>', \$sess2_err);
+$sess2_in = q{
+SELECT pg_backend_pid();
+};
+$sess2->pump until $sess2_out =~ qr/\d+\s*$/m;
+my $sess2_pid = $sess2_out;
+chomp $sess2_pid;
+
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 'f', 'idle session is not in recovery' );
+
+# Fast demote alpha.
+# Secondaries beta and gamma should keep streaming from it as cascaded standbys.
+# Idle in xact session should be terminate, idle session should stay alive.
+$node_alpha->demote('fast');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'node alpha demoted to standby' );
+
+is( $node_alpha->safe_psql(
+		'postgres',
+		'SELECT array_agg(application_name ORDER BY application_name ASC) FROM pg_stat_replication'),
+	'{beta,gamma}', 'standbys keep replicating with alpha after demote' );
+
+# the idle in xact session should not survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess1_pid}),
+	'0', 'previous idle in transaction session should be terminated' );
+
+# table "test_aborted" has been rollbacked
+is( $node_alpha->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM pg_catalog.pg_class
+		  WHERE relname='test_aborted'
+		    AND relnamespace = (SELECT oid FROM pg_namespace
+		                        WHERE nspname='public')}),
+	'0', 'the tansaction bas been aborted during fast demote' );
+
+# the idle session should survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess2_pid}),
+	'1', "the idle session should survive the demote: $sess2_pid" );
+
+# the idle session should report in recovery
+$sess2_out = '';
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 't', 'the idle session reports in recovery' );
+
+# close both sessions
+$sess1_out = $sess2_out = $sess1_in = $sess2_in = '';
+$sess1->finish;
+$sess2->finish;
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node alpha promoted" );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" );
+
+# Commit one 2PC and check it on alpha and beta
+$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'");
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' commited" );
+
+$node_alpha->wait_for_catchup($node_beta);
+$node_alpha->wait_for_catchup($node_gamma);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to beta" );
+
+is( $node_gamma->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to gamma" );
+
+# create another idle in xact session
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_succeed (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# swap roles between alpha and beta
+
+# Demote alpha in smart mode.
+# Don't wait for demote to complete here so we can use sess1
+# to keep doing some more write activity before commit and demote.
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_demote(false, false)'),
+	't', "demote signal sent to node alpha" );
+
+# wait for the demote to begin and wait for active xact.
+my $fh;
+while (1) {
+	my $status;
+	open my $fh, '<', $node_alpha->data_dir . '/postmaster.pid';
+	$status = $_ while <$fh>;
+	close $fh;
+	chomp($status);
+	last if $status eq 'demoting';
+	sleep 1;
+}
+
+# make sure the demote waits for running xacts
+sleep 2;
+
+# test no new session possible during demote
+$sess2_in = q{
+SELECT 1;
+};
+$sess2->start;
+$sess2->finish;
+ok( $sess2_err =~ /FATAL:  the database system is demoting\s$/, 'session rejected during demote process');
+
+# add some write activity on demote-blocking session sess1
+$sess1_out = '';
+$sess1_in = q{
+INSERT INTO public.test_succeed VALUES (1) RETURNING i;
+COMMIT;
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1->finish;
+
+chomp($sess1_out);
+is($sess1_out, '1', 'session in active xact able to write the smart demote signal');
+
+$node_alpha->poll_query_until('postgres', 'SELECT pg_is_in_recovery()', 't');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted" );
+
+# fetch the last REDO location from alpha and chek beta received everyting
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]);
+$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg;
+my $redo_loc = $1;
+
+is( $node_beta->safe_psql(
+		'postgres',
+		"SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "),
+	't', "node beta received the demote checkpoint from alpha" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# check gamma is still replicating from from alpha
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_gamma->name, 'gamma is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', "prepared transactions pxact2' exists" );
+
+# commit the second 2PC and check its result on alpha and beta nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' commited" );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to alpha" );
+
+# check the 2PC has been cascaded to gamma
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+is( $node_gamma->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to gamma" );
-- 
2.20.1

In reply to: Jehan-Guillaume de Rorthais (#28)
4 attachment(s)
Re: [patch] demote

Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Show quoted text

Demote now keeps backends with no active xid alive. Smart mode keeps all
backends: it waits for them to finish their xact and enter read-only. Fast
mode terminate backends wit an active xid and keeps all other ones.
Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
(check recovery state) once demoted.
During demote, no new session is allowed.

As backends with no active xid survive, a new SQL admin function
"pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
The resulting refactoring makes the code much simpler, cleaner, with better
isolation of actions from the code point of view.

Thanks to the refactoring, the patch now only adds one state to the state
machine: PM_DEMOTING. A second one could be use to replace:

/* Demoting: start the Startup Process */
if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)

with eg.:

if (pmState == PM_DEMOTED)

I believe it might be a bit simpler to understand, but the existing comment
might be good enough as well. The full state machine path for demote is:

PM_DEMOTING /* wait for active xid backend to finish */
PM_SHUTDOWN /* wait for checkpoint shutdown and its
various shutdown tasks */
PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
PM_STARTUP

Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
adding tests on backend behaviors during demote and new function pg_demote().

On my todo:

* cancel running checkpoint for fast demote ?
* forbid demote when PITR backup is in progress
* user documentation
* Robert's concern about snapshot during hot standby
* anything else reported to me

Plus, I might be able to split the backend part and their signals of the patch
0002 in its own patch if it helps the review. It would apply after 0001 and
before actual 0002.

As there was no consensus and the discussions seemed to conclude this patch
set should keep growing to see were it goes, I wonder if/when I should add it
to the commitfest. Advice? Opinion?

Attachments:

v5-0001-demote-setter-functions-for-LocalXLogInsert-local-va.patchtext/x-patchDownload
From 90e26a7e2d53f7a8436de0b73eb57498f884de9d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++++++++++++++++++++++----
 src/include/access/xlog.h         |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..c0d79f192c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..8c9cadc6da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void LocalSetXLogInsertNotAllowed(void);
+extern void LocalSetXLogInsertCheckRecovery(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
-- 
2.20.1

v5-0002-demote-support-demoting-instance-from-production-to-.patchtext/x-patchDownload
From f655909a19250eea98a668051f5941bdd87ad5b2 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 2/4] demote: support demoting instance from production to
 standby

This patch adds the ability for an in production read write
instance to step back as a read only standby. Two different
modes are supported: fast or smart. Fast demote mode cancel
transactions holding a xid to demote as fast as possible.
Smart demote mode wait for existing transactions holding a
xid to finish and forbid new ones.

The demote process is triggered by creating a "demote" or
"demote_fast" trigger file then signaling postmaster with
SIGUSR1. This patch adds "pg_ctl [-m {fast|smart}] demote"
and an SQL admin function will be added in a separate commit.

The demote procedure starts be setting the postmaster state
machine to PM_DEMOTING which ends when all backends are
read-only and set LocalXLogInsert=0 to forbid new writes.

When PM_DEMOTING finish, the state is set to PM_SHUTDOWN
to create a shutdown checkpoint and exit the checkpointer.
We could have kept the checkpointer around with some more
effort but there's no good reason worthing the additionnal
code complexity.

ShutdownXLOG now takes a boolean arg to handle demote
differently than a normal shutdown. This allows the
checkpointer to leave walsenders alive to demote faster.
Moreover, this might be useful in futur for eg.
implementing a controlled switchover over the replication
protocol.

The checkpointer set the cluster state as DB_DEMOTING so
the Startup process can detect the demote procedure in
StartupXLOG() and handle subsystems accordingly.

The startup process is started on checkpointer exit. The
postmaster state machine is then switched to PM_STARTUP.

Some sub-processes are kept alive during the demote procedure:
the stat collector, bgwriter and optionally archiver and wal
senders.

At the end of demote, USR1 is sent to all backends and wal
senders to set their environment as in recovery and cascading.

Discuss/Todo:

* add doc
* do not handle backup in progress during demote
* investigate snapshots shmem needs/init during recovery compare to
  production
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  98 ++++++++
 src/backend/access/transam/xlog.c       | 320 ++++++++++++++++--------
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/postmaster.c     | 235 ++++++++++++++++-
 src/backend/storage/ipc/procarray.c     |   2 +
 src/backend/storage/ipc/procsignal.c    |  30 +++
 src/backend/storage/lmgr/lock.c         |  12 +
 src/backend/tcop/postgres.c             |  44 ++++
 src/backend/utils/init/globals.c        |   1 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 117 +++++++++
 src/include/access/twophase.h           |   1 +
 src/include/access/xlog.h               |  23 +-
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   2 +-
 src/include/miscadmin.h                 |   1 +
 src/include/pgstat.h                    |   1 +
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/lock.h              |   2 +
 src/include/storage/procsignal.h        |   4 +
 src/include/tcop/tcopprot.h             |   2 +
 src/include/utils/pidfile.h             |   1 +
 23 files changed, 802 insertions(+), 129 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e3..31deb487ed 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1557,6 +1557,104 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assue all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ *
+ * FIXME demote: pay attention to the previous note when removing shutdown
+ * checkpoint from the demote procedure.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC	   *proc;
+		TransactionId xid;
+		char	   *buf;
+		char	   *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = gxact->xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+			   + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				+ MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->nabortrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									 (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0d79f192c..abc88ec241 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6298,6 +6298,11 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6321,6 +6326,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6385,6 +6391,25 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+
+			/*
+			 * previous state was RECOVERY_STATE_DONE. We need to
+			 * reinit it to something else so RecoveryInProgress()
+			 * doesn't return false.
+			 */
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6418,7 +6443,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6674,7 +6700,8 @@ StartupXLOG(void)
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) &&
+			!is_demoting;
 	}
 
 	/*
@@ -6736,9 +6763,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextXid),
@@ -6772,47 +6799,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
-
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
 
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
-
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
-
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
-
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6821,19 +6808,64 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
+
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
+
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
+
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
+
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
+
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -6888,11 +6920,25 @@ StartupXLOG(void)
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (is_demoting)
+			{
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				 */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+			{
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 
-			SpinLockAcquire(&XLogCtl->info_lck);
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-			SpinLockRelease(&XLogCtl->info_lck);
+				/* This is already set if demoting */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+			}
 		}
 		else
 		{
@@ -6982,7 +7028,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7044,7 +7091,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -7057,6 +7104,11 @@ StartupXLOG(void)
 			 * Startup commit log and subtrans only.  MultiXact and commit
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
+			 *
+			 * Starting up commit log is technicaly not needed during demote
+			 * as the in-memory data did not move. However, this is a
+			 * lightweight initialization and this might seem expected as
+			 * pure symmetry as ShutdownCLOG() is called during ShutdownXLog().
 			 */
 			StartupCLOG();
 			StartupSUBTRANS(oldestActiveXID);
@@ -7067,7 +7119,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7938,6 +7990,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8056,6 +8109,23 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Initialize the local TimeLineID
+ */
+bool
+SetLocalRecoveryInProgress(void)
+{
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+
+	return LocalRecoveryInProgress;
+}
+
 /*
  * Is the system still in recovery?
  *
@@ -8077,13 +8147,7 @@ RecoveryInProgress(void)
 		return false;
 	else
 	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+		SetLocalRecoveryInProgress();
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8503,6 +8567,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8512,35 +8578,60 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (is_demoting)
+	{
+		/*
+		 * In contrast with normal shutdown, we keep wal senders alive
+		 * during demote. First, this allows the demote to complete faster.
+		 * Second, we might need them in futur to implement a controlled
+		 * switchover over the replication protocol.
+		 */
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		ShutdownPreparedTransactions();
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
+
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
 	ShutdownCLOG();
 	ShutdownCommitTs();
@@ -8554,9 +8645,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8692,6 +8784,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8720,6 +8813,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8732,14 +8826,21 @@ CreateCheckPoint(int flags)
 	int			nvxids;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery or demote checkpoint is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN |
+				 CHECKPOINT_IS_DEMOTE |
+				 CHECKPOINT_END_OF_RECOVERY))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8780,7 +8881,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8826,7 +8927,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8994,8 +9095,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -9014,11 +9115,11 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -9034,7 +9135,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9047,7 +9149,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..58473a61fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -151,6 +151,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		proc_exit(0);
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..e75d33d335 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3854,6 +3854,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PROMOTE:
 			event_name = "Promote";
 			break;
+		case WAIT_EVENT_DEMOTE:
+			event_name = "Demote";
+			break;
 		case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT:
 			event_name = "RecoveryConflictSnapshot";
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..231febaf2f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -277,12 +277,13 @@ static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 #define			ImmediateShutdown	3
 
 static int	Shutdown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -325,6 +326,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* waiting for idle or RO backends for demote */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -429,10 +431,14 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static void RemoveDemoteSignalFiles(void);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
+static int	CountXacts(void);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
@@ -2319,6 +2325,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2450,16 +2461,19 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
-	 * state.  We treat autovac workers the same as user backends for this
-	 * purpose.  However, bgworkers are excluded from this test; we expect
-	 * bgworker_should_start_now() decided whether the DB state allows them.
+	 * Can't start backends when in startup/demote/shutdown/inconsistent
+	 * recovery state.  We treat autovac workers the same as user backends
+	 * for this purpose.  However, bgworkers are excluded from this test; we
+	 * expect bgworker_should_start_now() decided whether the DB state allows
+	 * them.
 	 */
 	if (pmState != PM_RUN && pmState != PM_HOT_STANDBY &&
 		backend_type != BACKEND_TYPE_BGWORKER)
 	{
 		if (Shutdown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
+		else if (DemoteSignal)
+			return CAC_DEMOTE;	/* demote is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
 				  pmState == PM_RECOVERY))
@@ -3091,7 +3105,18 @@ reaper(SIGNAL_ARGS)
 		if (pid == CheckpointerPID)
 		{
 			CheckpointerPID = 0;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) &&
+					 DemoteSignal &&
+					 pmState == PM_SHUTDOWN)
+			{
+				/*
+				 * The checkpointer exit signals the demote shutdown checkpoint
+				 * is done. The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+			}
+			else if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -3799,6 +3824,25 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	if (pmState == PM_DEMOTING)
+	{
+		int numXacts = CountXacts();
+
+		/*
+		 * PM_DEMOTING state ends when we have no active transactions
+		 * and all backends set LocalXLogInsertAllowed=0
+		 */
+		if (numXacts == 0)
+		{
+			ereport(LOG, (errmsg("all backends in read only")));
+
+			SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+			pmState = PM_SHUTDOWN;
+		}
+		else
+			ereport(LOG, (errmsg("waiting for %d transactions to finish", numXacts)));
+	}
+
 	/*
 	 * If we're ready to do so, signal child processes to shut down.  (This
 	 * isn't a persistent state, but treating it as a distinct pmState allows
@@ -4002,6 +4046,20 @@ PostmasterStateMachine(void)
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+
+	/* Demoting: start the Startup Process */
+	if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
+	{
+		/* stop archiver process if not required during standby */
+		if (!XLogArchivingAlways() && PgArchPID != 0)
+			signal_child(PgArchPID, SIGQUIT);
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		StartupStatus = STARTUP_RUNNING;
+		pmState = PM_STARTUP;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
@@ -5212,8 +5270,12 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
+		if (!DemoteSignal)
+			Assert(PgArchPID == 0);
+
 		Assert(CheckpointerPID == 0);
 		CheckpointerPID = StartCheckpointer();
+
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
@@ -5221,8 +5283,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
 		/*
@@ -5233,6 +5294,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5243,11 +5305,15 @@ sigusr1_handler(SIGNAL_ARGS)
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
 		pmState == PM_RECOVERY && Shutdown == NoShutdown)
 	{
+		dlist_iter	iter;
+
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5258,8 +5324,18 @@ sigusr1_handler(SIGNAL_ARGS)
 		sd_notify(0, "READY=1");
 #endif
 
+		if (DemoteSignal)
+			dlist_foreach(iter, &BackendList)
+			{
+				Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+				if (!bp->dead_end && bp->bkend_type & (BACKEND_TYPE_NORMAL|BACKEND_TYPE_WALSND))
+					SendProcSignal(bp->pid, PROCSIG_DEMOTED, InvalidBackendId);
+			}
+
 		pmState = PM_HOT_STANDBY;
 		connsAllowed = ALLOW_ALL_CONNS;
+		DemoteSignal = false;
 
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
@@ -5351,6 +5427,97 @@ sigusr1_handler(SIGNAL_ARGS)
 		signal_child(StartupPID, SIGUSR2);
 	}
 
+	if (CheckDemoteSignal() && pmState != PM_RUN )
+	{
+		DemoteSignal = false;
+		RemoveDemoteSignalFiles();
+		ereport(LOG,
+				(errmsg("ignoring demote signal because already in standby mode")));
+	}
+	/* received demote signal */
+	else if (CheckDemoteSignal())
+	{
+		FILE	   *standby_file;
+		dlist_iter	iter;
+		bool fast_demote;
+		struct stat stat_buf;
+
+		fast_demote = (stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0);
+
+		DemoteSignal = true;
+		RemoveDemoteSignalFiles();
+
+		/* create the standby signal file */
+		standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+		if (!standby_file)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			goto out;
+		}
+
+		if (FreeFile(standby_file))
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			unlink(STANDBY_SIGNAL_FILE);
+			goto out;
+		}
+
+		if (fast_demote == 0)
+		{
+			/* smart demote */
+			ereport(LOG, (errmsg("received smart demote request")));
+
+		}
+		else
+		{
+			/* fast demote */
+			ereport(LOG, (errmsg("received fast demote request")));
+		}
+
+		SignalSomeChildren(SIGTERM,
+						   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+
+		/* and the autovac launcher too */
+		if (AutoVacPID != 0)
+			signal_child(AutoVacPID, SIGTERM);
+		/* and the bgwriter too */
+		if (BgWriterPID != 0)
+			signal_child(BgWriterPID, SIGTERM);
+		/* and the walwriter too */
+		if (WalWriterPID != 0)
+			signal_child(WalWriterPID, SIGTERM);
+
+		dlist_foreach(iter, &BackendList)
+		{
+			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+			if (bp->dead_end)
+				continue;
+			/*
+			 * Assign bkend_type for any recently announced WAL Sender
+			 * processes.
+			 */
+			if (bp->bkend_type == BACKEND_TYPE_NORMAL &&
+				! IsPostmasterChildWalSender(bp->child_slot))
+				SendProcSignal(bp->pid,
+							   (fast_demote?PROCSIG_DEMOTING_FAST:PROCSIG_DEMOTING),
+							   InvalidBackendId);
+		}
+
+		pmState = PM_DEMOTING;
+
+		/* Report status */
+		AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+	}
+
+out:
+
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -5448,6 +5615,26 @@ CountChildren(int target)
 }
 
 
+/*
+ * Count up the number of active transactions
+ */
+static int
+CountXacts(void)
+{
+	int			i;
+	int			cnt = 0;
+
+	for (i = 0; i < ProcGlobal->allProcCount; ++i)
+	{
+		PGPROC   *proc = &ProcGlobal->allProcs[i];
+		if (TransactionIdIsValid(proc->xid))
+			cnt++;
+	}
+
+	return cnt;
+}
+
+
 /*
  * StartChildProcess -- start an auxiliary process for the postmaster
  *
@@ -5912,6 +6099,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
@@ -6660,3 +6848,28 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Remove the files signaling a demote request.
+ */
+static void
+RemoveDemoteSignalFiles(void)
+{
+	unlink(DEMOTE_SIGNAL_FILE);
+	unlink(DEMOTE_FAST_SIGNAL_FILE);
+}
+
+/*
+ * Check if a demote request appeared.
+ */
+static bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0 ||
+		stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 96e4a87857..83d7e05944 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -349,6 +349,8 @@ ProcArrayShmemSize(void)
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
+	 * FIXME demote: check safe hotStandby related init and snapshot mech.
+	 *
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
 	 * also created in various backends during GetSnapshotData(),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..ac14c662d3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,35 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	/* signal checkpoint process to ignite a demote procedure */
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 as soon as their active xact
+	 * finished
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING))
+		ReqDemoteHandler(PROCSIG_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 if they are idle, or
+	 * interrupt their current xact and terminate.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING_FAST))
+		ReqDemoteHandler(PROCSIG_DEMOTING_FAST);
+
+	/*
+	 * demote complete. Ask beckends to rely on
+	 * recovery status for LocalXLogInsertAllowed by
+	 * setting it to -1.
+	 * WAL sender set am_cascading.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTED))
+		ReqDemotedHandler(PROCSIG_DEMOTED);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..5843db0991 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4370,6 +4370,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+						void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..b44e6f5876 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinval.h"
@@ -3211,6 +3212,42 @@ ProcessInterrupts(void)
 		HandleParallelMessages();
 }
 
+/* SIGUSR1: set flag to demote */
+void
+ReqDemoteHandler(ProcSignalReason reason)
+{
+	if (MyBackendType != B_BACKEND)
+		return;
+
+	if (TransactionIdIsValid(MyProc->xid))
+	{
+		if (reason == PROCSIG_DEMOTING_FAST)
+		{
+			InterruptPending = true;
+			ProcDiePending = true;
+			SetLatch(MyLatch);
+		}
+		else
+			DemotePending = true;
+	}
+	else
+		LocalSetXLogInsertNotAllowed();
+}
+
+/* SIGUSR1: reset LocalRecoveryInProgress */
+void
+ReqDemotedHandler(ProcSignalReason reason)
+{
+	ereport(LOG,
+				(errmsg("received demote complete signal")));
+
+	SetLocalRecoveryInProgress();
+	LocalSetXLogInsertCheckRecovery();
+
+	if (MyBackendType == B_WAL_SENDER)
+		am_cascading_walsender = true;
+}
+
 
 /*
  * IA64-specific code to fetch the AR.BSP register for stack depth checks.
@@ -4224,6 +4261,12 @@ PostgresMain(int argc, char *argv[],
 				/* Send out notify signals and transmit self-notifies */
 				ProcessCompletedNotifies();
 
+				if (DemotePending) {
+					LocalSetXLogInsertNotAllowed();
+					DemotePending = false;
+					SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+				}
+
 				/*
 				 * Also process incoming notifies, if any.  This is mostly to
 				 * ensure stable behavior in tests: if any notifies were
@@ -4285,6 +4328,7 @@ PostgresMain(int argc, char *argv[],
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			SetLocalRecoveryInProgress();
 		}
 
 		/*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..021f6af434 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t DemotePending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f70..9ef133f79a 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 1cdc3ebaa3..a7805bd219 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,115 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+	else if (shutdown_mode == FAST_MODE)
+		snprintf(demote_file, MAXPGPATH, "%s/demote_fast", pg_data);
+	else
+		snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	sig = SIGUSR1;
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send demote signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * FIXME demote
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2447,6 +2559,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2554,6 +2668,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..4b56f92181 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void);
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8c9cadc6da..b1b1ea67f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -301,6 +303,7 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool SetLocalRecoveryInProgress(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
@@ -397,4 +400,8 @@ extern SessionBackupState get_backup_status(void);
 /* files to signal promotion to primary */
 #define PROMOTE_SIGNAL_FILE		"promote"
 
+/* files to signal demotion to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+#define DEMOTE_FAST_SIGNAL_FILE	"demote_fast"
+
 #endif							/* XLOG_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e..3b30c1d767 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 0a23281ad5..24ba3da013 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,7 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK, CAC_STARTUP, CAC_DEMOTE, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
 	CAC_SUPERUSER
 } CAC_state;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..d60804208f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t DemotePending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..f1c0a37e76 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -880,6 +880,7 @@ typedef enum
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_PROC_SIGNAL_BARRIER,
 	WAIT_EVENT_PROMOTE,
+	WAIT_EVENT_DEMOTE,
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1c3e9c1999..fa8d64e1a7 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -584,6 +584,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+									void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..7264e9a705 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,10 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_DEMOTING,			/* ask backends to demote in smart mode */
+	PROCSIG_DEMOTING_FAST,		/* ask backends to demote in fast mode */
+	PROCSIG_DEMOTED,			/* ask backends to switch to recovery mode */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index bd30607b07..e5f42f9fec 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -68,6 +68,8 @@ extern void StatementCancelHandler(SIGNAL_ARGS);
 extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn();
 extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1
 																 * handler */
+extern void ReqDemoteHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
+extern void ReqDemotedHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
 extern void ProcessClientReadInterrupt(bool blocked);
 extern void ProcessClientWriteInterrupt(bool blocked);
 
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

v5-0003-demote-add-pg_demote-function.patchtext/x-patchDownload
From 4b1f49b6c9fc099b85fb8a561a7035dde715b211 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 31 Jul 2020 18:07:38 +0200
Subject: [PATCH 3/4] demote: add pg_demote() function

---
 src/backend/access/transam/xlogfuncs.c | 94 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  6 ++
 src/include/catalog/pg_proc.dat        |  4 ++
 3 files changed, 104 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 290658b22c..733f465d38 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -784,3 +784,97 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Demotes a production server.
+ *
+ * A result of "true" means that demotion has been completed if "wait" is
+ * "true", or initiated if "wait" is false.
+ */
+Datum
+pg_demote(PG_FUNCTION_ARGS)
+{
+	bool		fast = PG_GETARG_BOOL(0);
+	bool		wait = PG_GETARG_BOOL(1);
+	int			wait_seconds = PG_GETARG_INT32(2);
+	char		demote_filename[] = "demote_fast";
+	FILE	   *demote_file;
+	int			i;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery in progress"),
+				 errhint("you can not demote while already in recovery.")));
+
+	if (!EnableHotStandby)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("function pg_demote() requires hot_standby parameter to be enabled"),
+				 errhint("The function can not return its status from a non hot_standby-enabled standby")));
+
+	if (wait_seconds <= 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"wait_seconds\" must not be negative or zero")));
+
+	if (!fast)
+		demote_filename[6] = '\0';
+
+	/* create the demote signal file */
+	demote_file = AllocateFile(demote_filename, "w");
+	if (!demote_file)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						demote_filename)));
+
+	if (FreeFile(demote_file))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write file \"%s\": %m",
+						demote_filename)));
+
+	/* signal the postmaster */
+	if (kill(PostmasterPid, SIGUSR1) != 0)
+	{
+		ereport(WARNING,
+				(errmsg("failed to send signal to postmaster: %m")));
+		(void) unlink(demote_filename);
+		PG_RETURN_BOOL(false);
+	}
+
+	/* return immediately if waiting was not requested */
+	if (!wait)
+		PG_RETURN_BOOL(true);
+
+	/* wait for the amount of time wanted until demotion */
+#define WAITS_PER_SECOND 10
+	for (i = 0; i < WAITS_PER_SECOND * wait_seconds; i++)
+	{
+		int			rc;
+
+		ResetLatch(MyLatch);
+
+		if (RecoveryInProgress())
+			PG_RETURN_BOOL(true);
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   1000L / WAITS_PER_SECOND,
+					   WAIT_EVENT_DEMOTE);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			PG_RETURN_BOOL(false);
+	}
+
+	ereport(WARNING,
+			(errmsg("server did not demote within %d seconds", wait_seconds)));
+	PG_RETURN_BOOL(false);
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbeab6..573d7b46eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1219,6 +1219,11 @@ CREATE OR REPLACE FUNCTION
   RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote'
   PARALLEL SAFE;
 
+CREATE OR REPLACE FUNCTION
+  pg_demote(fast boolean DEFAULT true, wait boolean DEFAULT true, wait_seconds integer DEFAULT 60)
+  RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_demote'
+  PARALLEL SAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
@@ -1435,6 +1440,7 @@ REVOKE EXECUTE ON FUNCTION pg_reload_conf() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_demote(boolean, boolean, integer) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_stat_reset() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 082a11f270..9e4d000d00 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6084,6 +6084,10 @@
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
   proargtypes => 'bool int4', proargnames => '{wait,wait_seconds}',
   prosrc => 'pg_promote' },
+{ oid => '8967', descr => 'demote production server',
+  proname => 'pg_demote', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'bool bool int4', proargnames => '{fast,wait,wait_seconds}',
+  prosrc => 'pg_demote' },
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
-- 
2.20.1

v5-0004-demote-add-various-tests-related-to-demote-and-promo.patchtext/x-patchDownload
From f47d4aca2ff2c938c33b8cabf0538e6f66c06d2b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Fri, 10 Jul 2020 02:00:38 +0200
Subject: [PATCH 4/4] demote: add various tests related to demote and promote
 actions

* demote/promote with a standby replicating from the node
* make sure 2PC survive a demote/promote cycle
* commit 2PC and check the result
* swap roles between primary and standby
* make sure wal sender enters cascade mode
* commit a 2PC on the new primary
* confirm behavior of backends during smart/fast demote
---
 src/test/perl/PostgresNode.pm             |  25 ++
 src/test/recovery/t/021_promote-demote.pl | 287 ++++++++++++++++++++++
 2 files changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/021_promote-demote.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..0f3d40088c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -906,6 +906,31 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self, $mode) = @_;
+	my $port    = $self->port;
+	my $pgdata  = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name    = $self->name;
+
+	$mode = 'fast' unless defined $mode;
+
+	print "### Demoting node \"$name\" using mode $mode\n";
+
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+		'-m', $mode, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl
new file mode 100644
index 0000000000..245acfb211
--- /dev/null
+++ b/src/test/recovery/t/021_promote-demote.pl
@@ -0,0 +1,287 @@
+# Test demote/promote actions in various scenarios using three
+# nodes alpha, beta and gamma. We check proper actions results,
+# correct data replication and cascade across multiple
+# demote/promote, manual switchover, smart and fast demote.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = get_new_node('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+));
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = get_new_node('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+# Create node gamma from backup
+my $node_gamma = get_new_node('gamma');
+$node_gamma->init_from_backup($node_alpha, $backup_name);
+$node_gamma->enable_streaming($node_alpha);
+$node_gamma->start;
+
+# Create some 2PC on alpha for future tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# create an in idle in xact session
+my ($sess1_in, $sess1_out, $sess1_err) = ('', '', '');
+my $sess1 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess1_in,
+	'>', \$sess1_out,
+	'2>', \$sess1_err);
+
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_aborted (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/[[:digit:]]+[\r\n]$/m;
+my $sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# create an in idle session
+my ($sess2_in, $sess2_out, $sess2_err) = ('', '', '');
+my $sess2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess2_in,
+	'>', \$sess2_out,
+	'2>', \$sess2_err);
+$sess2_in = q{
+SELECT pg_backend_pid();
+};
+$sess2->pump until $sess2_out =~ qr/\d+\s*$/m;
+my $sess2_pid = $sess2_out;
+chomp $sess2_pid;
+
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 'f', 'idle session is not in recovery' );
+
+# Fast demote alpha.
+# Secondaries beta and gamma should keep streaming from it as cascaded standbys.
+# Idle in xact session should be terminate, idle session should stay alive.
+$node_alpha->demote('fast');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'node alpha demoted to standby' );
+
+is( $node_alpha->safe_psql(
+		'postgres',
+		'SELECT array_agg(application_name ORDER BY application_name ASC) FROM pg_stat_replication'),
+	'{beta,gamma}', 'standbys keep replicating with alpha after demote' );
+
+# the idle in xact session should not survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess1_pid}),
+	'0', 'previous idle in transaction session should be terminated' );
+
+# table "test_aborted" has been rollbacked
+is( $node_alpha->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM pg_catalog.pg_class
+		  WHERE relname='test_aborted'
+		    AND relnamespace = (SELECT oid FROM pg_namespace
+		                        WHERE nspname='public')}),
+	'0', 'the tansaction bas been aborted during fast demote' );
+
+# the idle session should survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess2_pid}),
+	'1', "the idle session should survive the demote: $sess2_pid" );
+
+# the idle session should report in recovery
+$sess2_out = '';
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 't', 'the idle session reports in recovery' );
+
+# close both sessions
+$sess1_out = $sess2_out = $sess1_in = $sess2_in = '';
+$sess1->finish;
+$sess2->finish;
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node alpha promoted" );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" );
+
+# Commit one 2PC and check it on alpha and beta
+$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'");
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' commited" );
+
+$node_alpha->wait_for_catchup($node_beta);
+$node_alpha->wait_for_catchup($node_gamma);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to beta" );
+
+is( $node_gamma->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to gamma" );
+
+# create another idle in xact session
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_succeed (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# swap roles between alpha and beta
+
+# Demote alpha in smart mode.
+# Don't wait for demote to complete here so we can use sess1
+# to keep doing some more write activity before commit and demote.
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_demote(false, false)'),
+	't', "demote signal sent to node alpha" );
+
+# wait for the demote to begin and wait for active xact.
+my $fh;
+while (1) {
+	my $status;
+	open my $fh, '<', $node_alpha->data_dir . '/postmaster.pid';
+	$status = $_ while <$fh>;
+	close $fh;
+	chomp($status);
+	last if $status eq 'demoting';
+	sleep 1;
+}
+
+# make sure the demote waits for running xacts
+sleep 2;
+
+# test no new session possible during demote
+$sess2_in = q{
+SELECT 1;
+};
+$sess2->start;
+$sess2->finish;
+ok( $sess2_err =~ /FATAL:  the database system is demoting\s$/, 'session rejected during demote process');
+
+# add some write activity on demote-blocking session sess1
+$sess1_out = '';
+$sess1_in = q{
+INSERT INTO public.test_succeed VALUES (1) RETURNING i;
+COMMIT;
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1->finish;
+
+chomp($sess1_out);
+is($sess1_out, '1', 'session in active xact able to write the smart demote signal');
+
+$node_alpha->poll_query_until('postgres', 'SELECT pg_is_in_recovery()', 't');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted" );
+
+# fetch the last REDO location from alpha and chek beta received everyting
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]);
+$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg;
+my $redo_loc = $1;
+
+is( $node_beta->safe_psql(
+		'postgres',
+		"SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "),
+	't', "node beta received the demote checkpoint from alpha" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# check gamma is still replicating from from alpha
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_gamma->name, 'gamma is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', "prepared transactions pxact2' exists" );
+
+# commit the second 2PC and check its result on alpha and beta nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' commited" );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to alpha" );
+
+# check the 2PC has been cascaded to gamma
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+is( $node_gamma->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to gamma" );
-- 
2.20.1

In reply to: Jehan-Guillaume de Rorthais (#29)
Re: [patch] demote

On Tue, 18 Aug 2020 17:41:31 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Demote now keeps backends with no active xid alive. Smart mode keeps all
backends: it waits for them to finish their xact and enter read-only. Fast
mode terminate backends wit an active xid and keeps all other ones.
Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
(check recovery state) once demoted.
During demote, no new session is allowed.

As backends with no active xid survive, a new SQL admin function
"pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Just to keep the list inform, I found a race condition leading to backends
trying to write to XLog after they processed the demote signal. Eg.:

[posmaster] LOG: all backends in read only
[checkpointer] LOG: demoting
[backend] PANIC: cannot make new WAL entries during recovery
STATEMENT: UPDATE pgbench_accounts [...]

Because of this Postmaster enters in crash recovery while demote
environnement is in progress.

I have a couple of other subjects right now, but I plan to get back to it soon.

Regards,