WIP: Failover Slots

Started by Simon Riggsabout 10 years ago45 messages
#1Simon Riggs
simon@2ndQuadrant.com
1 attachment(s)

Failover Slots

If Logical Decoding is taking place from a master, then if we failover and
create a new master we would like to continue the Logical Decoding from the
new master. To facilitate this, we introduce the concept of “Failover
Slots”, which are slots that generate WAL, so that any downstream standby
can continue to produce logical decoding output from that named plugin.

In the current patch, any slot defined on a master will generate WAL,
leading to a pending-slot being present on all standby nodes. When a
standby is promoted, the slot becomes usable and will have the properties
as of the last fsync on the master.

Failover slots are not accessible until promotion of the standby. Logical
slots from a standby looks like a significant step beyond this and will not
be part of this patch.

Internal design is fairly clear, using a new rmgr. No real problems emerged
so far.

Patch is WIP, posted for comment, so you can see where I'm going.

I'm expecting to have a working version including timeline following for
9.6.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

failover_slots.v2.patchapplication/octet-stream; name=failover_slots.v2.patchDownload
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..a49ebf7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "update");
+
+				break;
+			}
+		case XLOG_REPLSLOT_CREATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "create");
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "drop %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "UPDATE";
+		case XLOG_REPLSLOT_CREATE:
+			return "CREATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6cbe65e..2c5e743 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f60687..01e1b8e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1ce9081..29c3f8a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO:
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
 	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 * 2) To prevent rows we need we would need to enhance hot_standby_feedback
+	 *    so it sends both xmin and catalog_xmin to the master.
+	 *    A standby slot can't write WAL, so we wouldn't be able to use it
+	 *    directly for failover, without some very complex state interactions
+	 *    via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c39e957..44fcc02 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -26,6 +26,16 @@
  * While the server is running, the state data is also cached in memory for
  * efficiency.
  *
+ * Originally, replication slots were unique to a single node, which meant
+ * they couldn't easily be used across replication failover. Global slots
+ * could come in various designs, the simplest of which is "failover slots".
+ * Any slot created on a master node generates WAL records that maintains
+ * a copy of the slot on standby nodes. If a standby node is promoted the
+ * failover slot allows access to be restarted just as if the the original
+ * master node was being accessed, allowing for the timeline change.
+ * Global slots may cause problems with name collisions with incautious
+ * choices of naming convention, which requires some additional checking.
+ *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
  * to iterate over the slots, and in exclusive mode to change the in_use flag
@@ -44,6 +54,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -97,12 +108,17 @@ ReplicationSlot *MyReplicationSlot = NULL;
 int			max_replication_slots = 0;	/* the maximum number of replication
 										 * slots */
 
-static void ReplicationSlotDropAcquired(void);
+static void ReplicationSlotDropAcquired(bool deactivate);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel,
+						   bool create, bool redo);
+
+/* internal redo functions */
+static void ReplicationSlotRedoUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoCreate(ReplicationSlotInWAL xlrec);
 
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
@@ -346,6 +362,11 @@ ReplicationSlotAcquire(const char *name)
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is already active for PID %d",
 					  name, active_pid)));
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+			   errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
 
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
@@ -367,9 +388,9 @@ ReplicationSlotRelease(void)
 		/*
 		 * Delete the slot. There is no !PANIC case where this is allowed to
 		 * fail, all that may happen is an incomplete cleanup of the on-disk
-		 * data.
+		 * data. Ensure we deactivate the slot also.
 		 */
-		ReplicationSlotDropAcquired();
+		ReplicationSlotDropAcquired(true);
 	}
 	else
 	{
@@ -397,15 +418,21 @@ ReplicationSlotDrop(const char *name)
 
 	ReplicationSlotAcquire(name);
 
-	ReplicationSlotDropAcquired();
+	/*
+	 * Ensure we deactivate slot
+	 */
+	ReplicationSlotDropAcquired(true);
 }
 
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * If deactivate is true, grab ReplicationSlotControlLock and change
+ * in_use flags for slot.
  */
 static void
-ReplicationSlotDropAcquired(void)
+ReplicationSlotDropAcquired(bool deactivate)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -423,6 +450,18 @@ ReplicationSlotDropAcquired(void)
 	 */
 	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
 
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed())
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), sizeof(NAMEDATALEN));
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
+
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
 	sprintf(tmppath, "pg_replslot/%s.tmp", NameStr(slot->data.name));
@@ -451,7 +490,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -470,10 +513,13 @@ ReplicationSlotDropAcquired(void)
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
-	slot->active_pid = 0;
-	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+	if (deactivate)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+		slot->active_pid = 0;
+		slot->in_use = false;
+		LWLockRelease(ReplicationSlotControlLock);
+	}
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
@@ -511,7 +557,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false, false);
 }
 
 /*
@@ -739,6 +785,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired(false);
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -860,7 +945,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, false, false);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -964,7 +1049,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, true, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -990,7 +1075,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool create, bool redo)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
@@ -998,15 +1084,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!redo)
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1039,6 +1128,21 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (!redo &&
+		slot->data.failover &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		if (create)
+			(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_CREATE);
+		else
+			(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1279,3 +1383,182 @@ RestoreSlotFromDisk(const char *name)
 				(errmsg("too many replication slots active before shutdown"),
 				 errhint("Increase max_replication_slots and try again.")));
 }
+
+static void
+ReplicationSlotRedoUpdate(ReplicationSlotInWAL xlrec)
+{
+	bool	found = false;
+	ReplicationSlot *slot;
+	int		i;
+
+	/*
+	 * Prevent any slot from being created/dropped while we're active. As we
+	 * explicitly do *not* want to block iterating over replication_slots or
+	 * acquiring a slot we cannot take the control lock - but that's OK,
+	 * because holding ReplicationSlotAllocationLock is strictly stronger, and
+	 * enough to guarantee that nobody can change the in_use bits on us.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		if (slot->in_use ||
+			strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) != 0)
+			continue;
+
+		/* update the persistent data */
+		slot->data.xmin = xlrec->xmin;
+		slot->data.catalog_xmin = xlrec->catalog_xmin;
+		slot->data.restart_lsn = xlrec->restart_lsn;
+		slot->data.confirmed_flush = xlrec->confirmed_flush;
+
+		/* update in memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		found = true;
+		break;
+	}
+
+	if (found)
+	{
+		char		path[MAXPGPATH];
+
+		sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+		SaveSlotToPath(slot, path, WARNING, false, true);
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found)
+	{
+		ereport(WARNING,
+				(errmsg("WAL record cannot find failover slot")));
+	}
+}
+
+static void
+ReplicationSlotRedoCreate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * Prevent any slot from being created/dropped while we're active. As we
+	 * explicitly do *not* want to block iterating over replication_slots or
+	 * acquiring a slot we cannot take the control lock - but that's OK,
+	 * because holding ReplicationSlotAllocationLock is strictly stronger, and
+	 * enough to guarantee that nobody can change the in_use bits on us.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first available slot, but keep on scanning...
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Check for any duplicates
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) != 0)
+		{
+			found_available = true;
+			found_duplicate = true;
+			use_slotid = i;
+			break;
+		}
+	}
+
+	if (found_duplicate)
+	{
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/*
+		 * Do something nasty to the sinful duplicants, but
+		 * take with locking.
+		 */
+
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	}
+
+	if (found_available)
+	{
+		char		path[MAXPGPATH];
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		/* initialize in memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		slot->candidate_catalog_xmin = InvalidTransactionId;
+		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+		slot->candidate_restart_lsn = InvalidXLogRecPtr;
+		slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+		sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+		SaveSlotToPath(slot, path, WARNING, true, true);
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+		ereport(WARNING,
+				(errmsg("WAL record cannot find failover slot")));
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Create a new failover slot. If there is already an existing
+		 * failover slot of that name, kill any user, then drop it and
+		 * create this one in its place.
+		 */
+		case XLOG_REPLSLOT_CREATE:
+			ReplicationSlotRedoCreate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index b3c8140..ee1b3e1 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 5b88a8d..07efbe7 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..d944747 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 20dd7a2..134ced2 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2015, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
@@ -11,69 +12,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -171,6 +115,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..aef0312
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ */
+#define XLOG_REPLSLOT_UPDATE	0x00
+#define XLOG_REPLSLOT_DROP		0x01
+#define XLOG_REPLSLOT_CREATE	0x02
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
#2Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#1)
Re: WIP: Failover Slots

On 2 January 2016 at 08:50, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch is WIP, posted for comment, so you can see where I'm going.

I've applied this on a branch of master and posted it, with some comment
editorialization, as
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . The tree
will be subject to rebasing.

At present the patch does not appear to work. No slots are visible in the
replica's pg_replication_slots before or after promotion and no slot
information is written to the xlog according to pg_xlogdump:

$ ~/pg/96/bin/pg_xlogdump -r ReplicationSlot 000000010000000000000001
000000010000000000000003
pg_xlogdump: FATAL: error in WAL record at 0/301DDE0: invalid record
length at 0/301DE10

so it's very much a WIP. I've read through it and think the idea makes
sense, it's just still missing some pieces...

So. Initial review comments.

This looks pretty incomplete:

+ if (found_duplicate)
+ {
+ LWLockRelease(ReplicationSlotAllocationLock);
+
+ /*
+ * Do something nasty to the sinful duplicants, but
+ * take with locking.
+ */
+
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+ }

... and I'm not sure I understand how the possibility of a duplicate slot
can arise in the first place, since you cannot create a slot on a read
replica. This seems unnecessary.

I'm not sure I understand why, in ReplicationSlotRedoCreate, it's
especially desirable to prevent blocking iteration over
pg_replication_slots or acquiring a slot. When redoing a slot create isn't
that exactly what we should do? This looks like it's been copied & pasted
verbatim from CheckPointReplicationSlots . There it makes sense, since the
slots may be in active use. During redo it seems reasonable to just
acquire ReplicationSlotControlLock.

I'm not a fan of the ReplicationSlotInWAL typedef
for ReplicationSlotPersistentData. Especially as it's only used in the redo
functions but *not* when xlog is written out. I'd like to just replace it.

Purely for convenient testing there's a shell script in the tree -
https://github.com/2ndQuadrant/postgres/blob/dev/failover-slots/failover-slot-test.sh
.
Assuming a patched 9.6 in $HOME/pg/96 it'll do a run-through of the patch.
I'll attempt to convert this to use the new test infrastructure, but needed
something ASAP for development. Posted in case it's useful to others.

Now it's time to look into why this doesn't seem to be generating any xlog
when by rights it seems like it should. Also into at what point exactly we
purge existing slots on start / promotion of a read-replica.

TL;DR: this doesn't work yet, working on it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#3Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#2)
Re: WIP: Failover Slots

On 20 January 2016 at 21:02, Craig Ringer <craig@2ndquadrant.com> wrote:

TL;DR: this doesn't work yet, working on it.

Nothing is logged on slot creation because ReplicationSlot->data.failover
is never true. Once that's fixed by - for now - making all slots failover
slots, there's a crash in XLogInsert because of the use of reserved bits in
the XLogInsert info argument. Fix pushed.

I also noticed that slot drops seem are being logged whether or not the
slot is a failover slot. Pushed a fix for that.

The WAL writing is now working. I've made improvements to the rmgr xlogdump
support to make it clearer what's written.

Slots are still not visible on the replica so there's work to do tracing
redo, promotion, slot handling after starting from a basebackup, etc. The
patch is still very much WIP.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#1)
Re: WIP: Failover Slots

On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Failover Slots
In the current patch, any slot defined on a master will generate WAL,
leading to a pending-slot being present on all standby nodes. When a standby
is promoted, the slot becomes usable and will have the properties as of the
last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#4)
Re: WIP: Failover Slots

On 21 January 2016 at 16:31, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Failover Slots
In the current patch, any slot defined on a master will generate WAL,
leading to a pending-slot being present on all standby nodes. When a

standby

is promoted, the slot becomes usable and will have the properties as of

the

last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

I was unsure myself; but making them optional seems reasonable.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#4)
Re: WIP: Failover Slots

On 22 January 2016 at 00:31, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Failover Slots
In the current patch, any slot defined on a master will generate WAL,
leading to a pending-slot being present on all standby nodes. When a

standby

is promoted, the slot becomes usable and will have the properties as of

the

last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

Yeah, I think that's the idea too. For one thing we'll want to allow
non-failover slots to continue to be usable from a streaming replica, but
we must ERROR if anyone attempts to attach to and replay from a failover
slot via a replica since we can't write WAL there. Both kinds are needed.

There's a 'failover' bool member in the slot persistent data for that
reason. It's not (yet) exposed via the UI.

I presume we'll want to:

* add a new default-false argument is_failover_slot or similar to
pg_create_logical_replication_slot and pg_create_physical_replication_slot

* Add a new optional flag argument FAILOVER to CREATE_REPLICATION_SLOT in
both its LOGICAL and PHYSICAL forms.

... and will be adding that to this patch, barring syntax objections etc.

It's also going to be necessary to handle what happens when a new failover
slot (physical or logical) is created on the master where it conflicts with
the name of a non-failover physical slot that was created on the replica.
In this case I am inclined to terminate any walsender backend for the
replica's slot with a conflict with recovery, remove its slot and replace
it with a failover slot. The failover slot does not permit replay while in
recovery so if the booted-off client reconnects it'll get an ERROR saying
it can't replay from a failover slot. It should be pretty clear to the
admin what's happening between the conflict with recovery and the failover
slot error. There could still be an issue if the client persistently keeps
retrying and successfully reconnects after replica promotion but I don't
see that much can be done about that. The documentation will need to
address the need to try to avoid name conflicts between slots created on
replicas and failover slots on the master.

I'll be working on those docs, on the name conflict handling, and on the
syntax during my coming flight. Toddler permitting.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#6)
Re: WIP: Failover Slots

On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

It's also going to be necessary to handle what happens when a new failover
slot (physical or logical) is created on the master where it conflicts with
the name of a non-failover physical slot that was created on the replica. In
this case I am inclined to terminate any walsender backend for the replica's
slot with a conflict with recovery, remove its slot and replace it with a
failover slot. The failover slot does not permit replay while in recovery so
if the booted-off client reconnects it'll get an ERROR saying it can't
replay from a failover slot. It should be pretty clear to the admin what's
happening between the conflict with recovery and the failover slot error.
There could still be an issue if the client persistently keeps retrying and
successfully reconnects after replica promotion but I don't see that much
can be done about that. The documentation will need to address the need to
try to avoid name conflicts between slots created on replicas and failover
slots on the master.

That's not going to win any design-of-the-year awards, but maybe it's
acceptable. It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want. Then again, I'm also not sure it isn't what we want.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: WIP: Failover Slots

On 2016-01-22 11:40:24 -0500, Robert Haas wrote:

It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want. Then again, I'm also not sure it isn't what we want.

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#7)
Re: WIP: Failover Slots

On 23 January 2016 at 00:40, Robert Haas <robertmhaas@gmail.com> wrote:

It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want. Then again, I'm also not sure it isn't what we want.

I think it's the most sensible default if there's only going to be one
choice to start with. It's consistent with what we do elsewhere with
replicas so there won't be any surprises. Failover slots are a fairly
simple feature that IMO just makes slots behave more like you might expect
them to do in the first place.

I'm pretty hesitant to start making cascaded replicas different to each
other just for slots. There are lots of other things where variation
between replicas would be lovely - the most obvious of which is omitting
some databases from some replicas. Right now we have a single data stream,
WAL, that goes to every replica. If we're going to change that I'd really
like to address it in a way that'll meet future needs like selective
physical replication too. I also think we'd want to deal with the problem
of identifying and labeling nodes to do a good job of selective replication
of slots.

I'd like to get failover slots in place for 9.6 since the're fairly
self-contained and meet an immediate need: allowing replication using slots
(physical or logical) to follow a failover event.

After that I want to add logical decoding support for slot
creation/drop/advance. So a logical decoding plugin can mirror logical slot
state on another node. It wouldn't be useful for physical slots, of course,
but it'd allow failover between logical replicas where we can do cool
things like replicate just a subset of DBs/tables/etc. (A mapping of
upstream to downstream LSNs would be required, but hopefully not too hard).
That's post-9.6 work and separate to failover slots, though dependent on
them for the ability to decode slot changes from WAL.

Down the track I'd very much like to be less dependent on forcing
everything though WAL with a single timeline. I agree with Andres that
being able to use failover slots on replicas would be good, and that it's
not possible when we use WAL to track the info. I just think it's a massive
increase in complexity and scope and I'd really like to be able to follow
failover the simple way first, then enhance it.

Nothing stops us changing failover slots in 9.7+ from using WAL to some
other mechanism that we design carefully at that time. We can extend the
walsender command set for physical rep at a major version bump with no
major issues, including adding to the streaming rep protocol. There's lots
to figure out though, including how to maintain slot changes in a strict
ordering with WAL, how to store and read the info, etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: WIP: Failover Slots

On 23 January 2016 at 00:51, Andres Freund <andres@anarazel.de> wrote:

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

Yeah. So you could have a read-replica that has a slot and it has child
nodes you can fail over to, but you don't have to have the slot on the
master.

I don't personally find that to be a particularly compelling thing that
says "we must have this" ... but maybe I'm not seeing the full
significance/advantages.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

That's part of it. Also the mechanism by which we actually replicate them -
protocol additions for the walsender protocol, how to reliably send
something that doesn't have an LSN, etc. It might be fairly simple, I
haven't thought about it deeply, but I'd rather not go there until the
basics are in place.

BTW, I'm keeping a working tree at
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . Subject
to rebasing, history not clean. It has a test script in it that'll go away
before patch posting.

Current state needs work to ensure that on-disk and in-memory
representations are kept in sync, but is getting there.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#10)
1 attachment(s)
Re: WIP: Failover Slots

Hi all

Here's v2 of the failover slots patch. It replicates a logical slot to a
physical streaming replica downstream, keeping the slots in sync. After the
downstream is promoted a client can replay from the logical slot.

UI to allow creation of non-failover slots is pending.

There's more testing to do to cover all the corners: drop slots, drop and
re-create, name conflicts between downstream !failover slots and upstream
failover slots, etc.

There's also a known bug where WAL isn't correctly retained for a slot
where that slot was created before a basebackup which I'll fix in a
revision shortly.

I'm interested in ideas on how to better test this.

--
Craig Ringer http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Implement-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0001-Implement-failover-slots.patchDownload
From c2535eb27c6efc5dddd16a6aa7142fd0f59e85d3 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 20 Jan 2016 17:16:29 +0800
Subject: [PATCH 1/2] Implement failover slots

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to
anyway.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

Simon Riggs and Craig Ringer

WIP. Open items:

* Testing
* Implement !failover slots and UI for marking slots as failover slots
* Fix WAL retention for slots created before a basebackup
---
 src/backend/access/rmgrdesc/Makefile       |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  63 +++++
 src/backend/access/transam/rmgr.c          |   1 +
 src/backend/commands/dbcommands.c          |   3 +
 src/backend/replication/basebackup.c       |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  19 +-
 src/backend/replication/slot.c             | 433 ++++++++++++++++++++++++++++-
 src/backend/replication/slotfuncs.c        |   1 +
 src/bin/pg_xlogdump/replslotdesc.c         |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c             |   1 +
 src/include/access/rmgrlist.h              |   1 +
 src/include/replication/slot.h             |  61 +---
 src/include/replication/slot_xlog.h        | 103 +++++++
 14 files changed, 610 insertions(+), 92 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..b882846
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "slot %s to xmin=%u, catmin=%u, restart_lsn="UINT64_FORMAT"@%u",
+						NameStr(xlrec->name), xlrec->xmin, xlrec->catalog_xmin,
+						xlrec->restart_lsn, xlrec->restart_tli);
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "CREATE_OR_UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 88c3a49..76fc5c7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..37ffa82 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we need we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c12e412..ce278dd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -26,6 +26,17 @@
  * While the server is running, the state data is also cached in memory for
  * efficiency.
  *
+ * Any slot created on a master node generates WAL records that maintain a copy
+ * of the slot on standby nodes. If a standby node is promoted the failover
+ * slot allows access to be restarted just as if the the original master node
+ * was being accessed, allowing for the timeline change. The replica considers
+ * slot positions when removing WAL to make sure it can satisfy the needs of
+ * slots after promotion. For logical decoding slots the slot's internal state
+ * is kept up to date so it's ready for use after promotion.
+ *
+ * Since replication slots cannot be created on a standby there's no risk of
+ * name collision from slot creation on the master.
+ *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
  * to iterate over the slots, and in exclusive mode to change the in_use flag
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -104,6 +116,10 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -265,11 +281,21 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	Assert(!slot->in_use);
 	Assert(slot->active_pid == 0);
 	slot->data.persistency = persistency;
+
+	elog(LOG, "persistency is %i", (int)slot->data.persistency);
+
 	slot->data.xmin = InvalidTransactionId;
 	slot->effective_xmin = InvalidTransactionId;
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/*
+	 * TODO: control over whether a slot is a failover slot.
+	 *
+	 * For now make them all failover if created on the master. Which
+	 * is all slots, since you can't make one on a replica.
+	 */
+	slot->data.failover = !RecoveryInProgress();
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -305,6 +331,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -327,7 +357,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && slot->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -341,12 +375,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is already active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -403,16 +449,23 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
-ReplicationSlotDropAcquired(void)
+ReplicationSlotDropAcquired()
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -423,6 +476,18 @@ ReplicationSlotDropAcquired(void)
 	 */
 	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
 
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), sizeof(NAMEDATALEN));
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
+
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
 	sprintf(tmppath, "pg_replslot/%s.tmp", NameStr(slot->data.name));
@@ -451,7 +516,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -469,11 +538,20 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
@@ -536,6 +614,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -739,6 +820,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -988,6 +1108,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -998,15 +1120,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1039,6 +1164,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1279,3 +1423,266 @@ RestoreSlotFromDisk(const char *name)
 				(errmsg("too many replication slots active before shutdown"),
 				 errhint("Increase max_replication_slots and try again.")));
 }
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * Find the slot if it exists, or the first free entry
+	 * to write it to otherwise. Also handle the case where
+	 * the slot exists on the downstream as a non-failover
+	 * slot with a clashing name.
+	 *
+	 * We're in redo, but someone could still create an ephemeral
+	 * slot and race with us unless we take the allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning...
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Keep looking for an existing slot with the same name. It could be
+		 * our failover slot to update or a non-failover slot with a
+		 * conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * TODO.
+		 *
+		 * A name clash with the incoming failover slot may occur when
+		 * a non-failover slot was created locally on a replica or when
+		 * redo failed partway through on a failover slot.
+		 *
+		 * For conflicting local slots We handle this by aborting any
+		 * connection using the slot with a conflict in recovery error then
+		 * removing the local non-failover slot. The replacement slot won't
+		 * allow replay until promotion so if the old client reconnects it
+		 * won't be able to make a mess by advancing the failover slot.
+		 *
+		 * 'conflict with recovery' aborts are only done for regular backends
+		 * so we'll have to send a cancel.
+		 *
+		 * We could race against the client, which might be able to restart
+		 * and re-acquire the slot before we can. Unlikely, but not impossible.
+		 * So we have to cope with the slot still being in_use when we
+		 * look at it after killing its client.
+		 *
+		 * This is all a bit complicated so in the WIP patch just ERROR here,
+		 * letting the user clean up the mess by dropping the conflicting slot.
+		 *
+		 * For ephemeral slots this is a no-brainer, but it's less pretty for
+		 * persistent downstream slots.
+		 */
+
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("A local non-failover slot with the name %s already exists",
+					 NameStr(xlrec->name)),
+				 errdetail("While replaying the creation of a failover slot from the "
+						   "master an existing non-failover slot with the same name "
+						   "was found on the replica. Replay cannot continue until "
+						   "the conflicting slot is dropped on the replica. ")));
+
+		// not continuing, but we should hold this lock if we do
+		// continue...
+		//LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	}
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update. Most of
+	 * the logic is the same.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			elog(DEBUG1, "Updating existing slot %s", NameStr(slot->data.name));
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			elog(DEBUG1, "Creating slot %s", NameStr(slot->data.name));
+
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	/*
+	 * Search for the named failover slot and mark it active if we
+	 * find it.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "BUG: found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "BUG: found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+		ReplicationSlotDropAcquired();
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..e90d079 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 80ad02a..cb35181 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
@@ -11,69 +12,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -171,6 +115,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..7caf009
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ *
+ * Note that the low 4 bits are reserved by the system. The high 4 bits are for
+ * rmgr use.
+ */
+#define XLOG_REPLSLOT_UPDATE	0x10
+#define XLOG_REPLSLOT_DROP		0x20
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
-- 
2.1.0

#12Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#11)
1 attachment(s)
Re: WIP: Failover Slots

Hi all

Here's v3 of failover slots.

It doesn't add the UI yet, but it's now functionally complete except for
timeline following for logical slots, and I have a plan for that.

Attachments:

failover-slots-v3.patchtext/x-patch; charset=US-ASCII; name=failover-slots-v3.patchDownload
From 533a9327b54ba744b0a1fb0048e8cfe7d3d45ea1 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 20 Jan 2016 17:16:29 +0800
Subject: [PATCH 1/2] Implement failover slots

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to
anyway.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

Simon Riggs and Craig Ringer

WIP. Open items:

* Testing
* Implement !failover slots and UI for marking slots as failover slots
* Fix WAL retention for slots created before a basebackup
---
 src/backend/access/rmgrdesc/Makefile           |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c     |  63 ++++
 src/backend/access/transam/rmgr.c              |   1 +
 src/backend/access/transam/xlogutils.c         |   6 +-
 src/backend/commands/dbcommands.c              |   3 +
 src/backend/replication/basebackup.c           |  12 -
 src/backend/replication/logical/decode.c       |   1 +
 src/backend/replication/logical/logical.c      |  19 +-
 src/backend/replication/logical/logicalfuncs.c |   3 +
 src/backend/replication/slot.c                 | 439 ++++++++++++++++++++++++-
 src/backend/replication/slotfuncs.c            |   1 +
 src/bin/pg_xlogdump/replslotdesc.c             |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c                 |   1 +
 src/include/access/rmgrlist.h                  |   1 +
 src/include/replication/slot.h                 |  61 +---
 src/include/replication/slot_xlog.h            | 103 ++++++
 16 files changed, 624 insertions(+), 93 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..b882846
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "slot %s to xmin=%u, catmin=%u, restart_lsn="UINT64_FORMAT"@%u",
+						NameStr(xlrec->name), xlrec->xmin, xlrec->catalog_xmin,
+						xlrec->restart_lsn, xlrec->restart_tli);
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "CREATE_OR_UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 444e218..180a7d9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -770,7 +770,11 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 */
 		if (!RecoveryInProgress())
 		{
-			*pageTLI = ThisTimeLineID;
+			if (*pageTLI == 0)
+			{
+				/* caller may have set timeline already */
+				*pageTLI = ThisTimeLineID;
+			}
 			flushptr = GetFlushRecPtr();
 		}
 		else
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 88c3a49..76fc5c7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..37ffa82 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we need we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f789fc1..9e5bde1 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -114,6 +114,9 @@ int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
 {
+	LogicalDecodingContext *lctx = (LogicalDecodingContext*)state->private_data;
+	*pageTLI = lctx->slot->data.restart_tli;
+
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
 						 targetRecPtr, cur_page, pageTLI);
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c12e412..e19fd4b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -26,6 +26,17 @@
  * While the server is running, the state data is also cached in memory for
  * efficiency.
  *
+ * Any slot created on a master node generates WAL records that maintain a copy
+ * of the slot on standby nodes. If a standby node is promoted the failover
+ * slot allows access to be restarted just as if the the original master node
+ * was being accessed, allowing for the timeline change. The replica considers
+ * slot positions when removing WAL to make sure it can satisfy the needs of
+ * slots after promotion. For logical decoding slots the slot's internal state
+ * is kept up to date so it's ready for use after promotion.
+ *
+ * Since replication slots cannot be created on a standby there's no risk of
+ * name collision from slot creation on the master.
+ *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
  * to iterate over the slots, and in exclusive mode to change the in_use flag
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -104,6 +116,10 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -265,11 +281,23 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	Assert(!slot->in_use);
 	Assert(slot->active_pid == 0);
 	slot->data.persistency = persistency;
+
+	elog(LOG, "persistency is %i", (int)slot->data.persistency);
+
 	slot->data.xmin = InvalidTransactionId;
 	slot->effective_xmin = InvalidTransactionId;
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	slot->data.restart_tli = 0;
+
+	/*
+	 * TODO: control over whether a slot is a failover slot.
+	 *
+	 * For now make them all failover if created on the master. Which
+	 * is all slots, since you can't make one on a replica.
+	 */
+	slot->data.failover = !RecoveryInProgress();
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -305,6 +333,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -327,7 +359,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && slot->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -341,12 +377,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is already active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -403,16 +451,23 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
-ReplicationSlotDropAcquired(void)
+ReplicationSlotDropAcquired()
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -423,6 +478,18 @@ ReplicationSlotDropAcquired(void)
 	 */
 	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
 
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), sizeof(NAMEDATALEN));
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
+
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
 	sprintf(tmppath, "pg_replslot/%s.tmp", NameStr(slot->data.name));
@@ -451,7 +518,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -469,11 +540,20 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
@@ -536,6 +616,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -739,6 +822,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -771,6 +893,7 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
@@ -800,6 +923,7 @@ ReplicationSlotReserveWal(void)
 
 			/* start at current insert position */
 			slot->data.restart_lsn = GetXLogInsertRecPtr();
+			slot->data.restart_tli = ThisTimeLineID;
 
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
@@ -810,6 +934,12 @@ ReplicationSlotReserveWal(void)
 		else
 		{
 			slot->data.restart_lsn = GetRedoRecPtr();
+			/*
+			 * We don't actually use this yet. The walsender tracks timelines
+			 * for physical slots and non-slot based replay at a higher level.
+			 * It wouldn't get updated on TLI switch anyway.
+			 */
+			slot->data.restart_tli = 0;
 		}
 
 		/* prevent WAL removal as fast as possible */
@@ -988,6 +1118,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -998,15 +1130,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1039,6 +1174,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1279,3 +1433,262 @@ RestoreSlotFromDisk(const char *name)
 				(errmsg("too many replication slots active before shutdown"),
 				 errhint("Increase max_replication_slots and try again.")));
 }
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * Find the slot if it exists, or the first free entry
+	 * to write it to otherwise. Also handle the case where
+	 * the slot exists on the downstream as a non-failover
+	 * slot with a clashing name.
+	 *
+	 * We're in redo, but someone could still create an ephemeral
+	 * slot and race with us unless we take the allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning...
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Keep looking for an existing slot with the same name. It could be
+		 * our failover slot to update or a non-failover slot with a
+		 * conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * TODO.
+		 *
+		 * A name clash with the incoming failover slot may occur when
+		 * a non-failover slot was created locally on a replica or when
+		 * redo failed partway through on a failover slot.
+		 *
+		 * For conflicting local slots We handle this by aborting any
+		 * connection using the slot with a conflict in recovery error then
+		 * removing the local non-failover slot. The replacement slot won't
+		 * allow replay until promotion so if the old client reconnects it
+		 * won't be able to make a mess by advancing the failover slot.
+		 *
+		 * 'conflict with recovery' aborts are only done for regular backends
+		 * so we'll have to send a cancel.
+		 *
+		 * We could race against the client, which might be able to restart
+		 * and re-acquire the slot before we can. Unlikely, but not impossible.
+		 * So we have to cope with the slot still being in_use when we
+		 * look at it after killing its client.
+		 *
+		 * This is all a bit complicated so in the WIP patch just ERROR here,
+		 * letting the user clean up the mess by dropping the conflicting slot.
+		 *
+		 * For ephemeral slots this is a no-brainer, but it's less pretty for
+		 * persistent downstream slots.
+		 */
+
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("A local non-failover slot with the name %s already exists",
+					 NameStr(xlrec->name)),
+				 errdetail("While replaying the creation of a failover slot from the "
+						   "master an existing non-failover slot with the same name "
+						   "was found on the replica. Replay cannot continue until "
+						   "the conflicting slot is dropped on the replica. ")));
+
+		// not continuing, but we should hold this lock if we do
+		// continue...
+		//LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+	}
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update. Most of
+	 * the logic is the same.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	/*
+	 * Search for the named failover slot and mark it active if we
+	 * find it.
+	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "BUG: found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "BUG: found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+		ReplicationSlotDropAcquired();
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..e90d079 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 80ad02a..cb35181 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
@@ -11,69 +12,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -171,6 +115,7 @@ extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
 extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..7caf009
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ *
+ * Note that the low 4 bits are reserved by the system. The high 4 bits are for
+ * rmgr use.
+ */
+#define XLOG_REPLSLOT_UPDATE	0x10
+#define XLOG_REPLSLOT_DROP		0x20
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
-- 
2.1.0

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: WIP: Failover Slots

On Fri, Jan 22, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote:

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

For this to work, I feel like the nodes need names, and a directory
that tells them how to reach each other.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#12)
4 attachment(s)
Re: WIP: Failover Slots

Hi all

I've attached the completed failover slots patch.

There were quite a few details to address so this is split into a three
patch series. I've also attached the test script I've been using.

We need failover slots to allow a logical decoding client to follow
physical failover, allowing HA "above" a logical decoding client.

For reviewers there's some additional explanation for why a few of the
changes are made the way they are on the wiki:
https://wiki.postgresql.org/wiki/Failover_slots . See "Patch notes".

The tagged tree for this submission is at
https://github.com/2ndQuadrant/postgres/tree/failover-slots-v4 .

I intend to backport this to 9.4 and 9.5 (though of course not for mainline
submission!).

Patch 1: add support for timeline following in logical decoding.

This is necessary to make failover slots useful. Otherwise decoding from a
slot will fail after failover because the server tries to read WAL with
ThisTimeLineID but the needed archives are on a historical timeline. While
the smallest part of the patch series this was the most complex.

Patch 2: Failover slots core

* Add WAL logging and redo for failover slots

* copy pg_replslot/ in pg_basebackup

* drop non-failover slots on archive recovery startup

* expand the amount of WAL copied by pg_basebackup so failover slots are
usable after restore

* if a failover slot is created on the primary with the same name as an
existing non-failover slot on replica(s), kill any client connected to the
replica's slot and drop the replica's slot during redo

* Adds a new backup label entry MIN FAILOVER SLOT LSN to generated
backup label files if failover slots are present. This allows utilities
like
pgbarman, omnipitr, etc to know to retain more WAL to preserve the
function of failover slots.

* Return a lower LSN from pg_start_backup() and BASE_BACKUP
if needed to ensure that tools copy the extra WAL required by failover
slots during a base backup.

Relies on timeline following for logical decoding slots to be useful.

Does not add UI (function arguments, walsender syntax, changes to views,
etc) to expose failover slots to users. They can only be used by extensions
that call ReplicationSlotCreate directly.

Patch 3: User interface for failover slots

The 3rd patch adds the UI to expose failover slots to the user:

- A 'failover' boolean argument, default false, to
pg_create_physical_replication_slot(...) and
pg_create_logical_replication_slot(...)
- A new FAILOVER option to PG_CREATE_REPLICATION_SLOT on the walsender
protocol
- A new 'failover' boolean column in pg_catalog.pg_replication_slots
- SGML documentation changes for the new options and for failover slots in
general

Limited tests are also added in this patch since not much of this is really
testable by pg_regress. I've attached my local test script in case it's of
interest/use to anyone.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Allow-logical-slots-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-logical-slots-to-follow-timeline-switches.patchDownload
From ff34b65ae8ad4b02ed58cb0575ce79e7498d4988 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 11 Feb 2016 10:44:14 +0800
Subject: [PATCH 1/4] Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c        |  43 ++++-
 src/backend/access/transam/xlogutils.c         | 214 +++++++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c |  38 ++++-
 src/include/access/xlogreader.h                |  33 +++-
 src/include/access/xlogutils.h                 |   2 +
 5 files changed, 295 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..5899f44 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 
@@ -116,6 +119,9 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+	/* Will be loaded on first read */
+	state->timelineHistory = NULL;
+
 	return state;
 }
 
@@ -135,6 +141,13 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifdef FRONTEND
+	/* FE code doesn't use this and we can't list_free_deep on FE */
+	Assert(state->timelineHistory == NULL);
+#else
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,9 +221,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 
 	if (RecPtr == InvalidXLogRecPtr)
 	{
+		/* No explicit start point, read the record after the one we just read */
 		RecPtr = state->EndRecPtr;
 
 		if (state->ReadRecPtr == InvalidXLogRecPtr)
+			/* allow readPageTLI to go backward */
 			randAccess = true;
 
 		/*
@@ -223,6 +238,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	else
 	{
 		/*
+		 * Caller supplied a position to start at.
+		 *
 		 * In this case, the passed-in record pointer should already be
 		 * pointing to a valid record starting position.
 		 */
@@ -309,8 +326,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		/* XXX: more validation should be done here */
 		if (total_len < SizeOfXLogRecord)
 		{
-			report_invalid_record(state, "invalid record length at %X/%X",
-								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+								  SizeOfXLogRecord, total_len);
 			goto err;
 		}
 		gotheader = false;
@@ -466,9 +484,7 @@ err:
 	 * Invalidate the xlog page we've cached. We might read from a different
 	 * source after failure.
 	 */
-	state->readSegNo = 0;
-	state->readOff = 0;
-	state->readLen = 0;
+	XLogReaderInvalCache(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -599,9 +615,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 {
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X",
-							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+							  SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
 	if (record->xl_rmid > RM_MAX_ID)
@@ -1337,3 +1353,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 	return true;
 }
+
+/*
+ * Invalidate the xlog reader's cached page to force a re-read
+ */
+void
+XLogReaderInvalCache(XLogReaderState *state)
+{
+	state->readSegNo = 0;
+	state->readOff = 0;
+	state->readLen = 0;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 444e218..85bac01 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -7,6 +7,9 @@
  * This file contains support routines that are used by XLOG replay functions.
  * None of this code is used during normal system operation.
  *
+ * Unlike xlogreader.c this is only compiled for the backend so it may use
+ * elog, etc.
+ *
  *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -21,6 +24,7 @@
 
 #include "miscadmin.h"
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -651,6 +655,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
 	static uint32 sendOff = 0;
+	/* So we notice if asked for the same seg on a new tli: */
+	static TimeLineID lastTLI = 0;
 
 	p = buf;
 	recptr = startptr;
@@ -664,11 +670,11 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 
 		startoff = recptr % XLogSegSize;
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		/* Do we need to switch to a new xlog segment? */
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || lastTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
-			/* Switch to another logfile segment */
 			if (sendFile >= 0)
 				close(sendFile);
 
@@ -692,6 +698,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			lastTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -759,28 +766,66 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it after each loop because if we're in
+		 * recovery as a cascading standby the current timeline
+		 * might've become historical.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			flushptr = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might
+			 * have to wait for the desired record to be generated
+			 * (or, for a standby, received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				flushptr = GetFlushRecPtr();
+			}
+			else
+				flushptr = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= flushptr)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			flushptr = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= flushptr)
+		{
+			/*
+			 * We're on a historical timeline, limit reading to the
+			 * switch point where we moved to the next timeline.
+			 *
+			 * We could just jump to the next timeline early since
+			 * the whole segment the last page is on got copied onto
+			 * the new timeline, but this is simpler.
+			 */
+			flushptr = state->currTLIValidUntil;
+
+			/*
+			 * FIXME: Setting pageTLI to the TLI the *record* we
+			 * want is on can be slightly wrong; the page might
+			 * begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will've been copied
+			 * from the prior timeline. We should really read the
+			 * page header. It's pretty harmless though as nothing
+			 * cares so long as the timeline doesn't go backwards.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	/* more than one block available */
@@ -793,7 +838,142 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else
 		count = flushptr - targetPagePtr;
 
-	XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, *pageTLI, targetPagePtr, count);
 
 	return count;
 }
+
+/*
+ * Figure out what timeline to look on for the record the xlogreader
+ * is being asked asked to read, in currRecPtr. This may be used
+ * to determine which xlog segment file to open, etc.
+ *
+ * It depends on:
+ *
+ * - Whether we're reading a record immediately following one we read
+ *   before or doing a random read. We can only use the cached
+ *   timeline info if we're reading sequentially.
+ *
+ * - Whether the timeline of the prior record read was historical or
+ *   the current timeline and, if historical, on where it's valid up
+ *   to. On a historical timeline we need to avoid reading past the
+ *   timeline switch point. The records after it are probably invalid,
+ *   but worse, they might be valid but *different*.
+ *
+ * - If the current timeline became historical since the last record
+ *   we read. We need to make sure we don't read past the switch
+ *   point.
+ *
+ * None of this has any effect unless callbacks use currTLI to
+ * determine which timeline to read from and optionally use the
+ * validity limit to avoid reading past the valid end of a page.
+ *
+ * Note that an xlog segment may contain data from an older timeline
+ * if it was copied during a timeline switch. Callers may NOT assume
+ * that currTLI is the timeline that will be in a given page's
+ * xlp_tli; the page may begin on older timeline.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state)
+{
+	if (state->timelineHistory == NULL)
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+	if (state->currTLIValidUntil == InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0)
+	{
+		/*
+		 * We were reading what was the current timeline but it became
+		 * historical. Either we were replaying as a replica and got
+		 * promoted or we're replaying as a cascading replica from a
+		 * parent that got promoted.
+		 *
+		 * Force a re-read of the timeline history.
+		 */
+		list_free_deep(state->timelineHistory);
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		elog(DEBUG2, "timeline %u became historical during decoding",
+				state->currTLI);
+
+		/* then invalidate the timeline info so we read again */
+		state->currTLI = 0;
+	}
+
+	if (state->currRecPtr == state->EndRecPtr &&
+		state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currRecPtr >= state->currTLIValidUntil)
+	{
+		/*
+		 * We're reading the immedately following record but we're at
+		 * a timeline boundary and must read the next record from the
+		 * new TLI.
+		 */
+		elog(DEBUG2, "Requested record %X/%X is after end of cur TLI %u "
+				"valid until %X/%X, switching to next timeline",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+
+		/* Invalidate TLI info so we look it up again */
+		state->currTLI = 0;
+		state->currTLIValidUntil = InvalidXLogRecPtr;
+	}
+
+	if (state->currRecPtr != state->EndRecPtr ||
+		state->currTLI == 0)
+	{
+		/*
+		 * Something changed. We're not reading the record immediately
+		 * after the one we just read, the previous record was at
+		 * timeline boundary or we didn't yet determine the timeline
+		 * to read from.
+		 *
+		 * Work out what timeline to read this record from.
+		 */
+		state->currTLI = tliOfPointInHistory(state->currRecPtr,
+				state->timelineHistory);
+
+		if (state->currTLI != ThisTimeLineID)
+		{
+			/*
+			 * It's on a historical timeline.
+			 *
+			 * We'll probably read more records after this so make a
+			 * note of the point at we have to stop reading and do
+			 * another TLI switch.
+			 *
+			 * Callbacks can also use this to avoid reading past the
+			 * valid end of the TLI.
+			 */
+			state->currTLIValidUntil = tliSwitchPoint(state->currTLI,
+					state->timelineHistory, NULL);
+		}
+		else
+		{
+			/*
+			 * We're on the current timeline. The callback can use the
+			 * xlog flush position and we don't have to worry about
+			 * the TLI ending.
+			 *
+			 * If we're in recovery from another standby (cascading)
+			 * we could receive a new timeline, making the current
+			 * timeline historical. We check that by comparing currTLI
+			 * again at each record read.
+			 */
+			state->currTLIValidUntil = InvalidXLogRecPtr;
+		}
+
+		elog(DEBUG2, "XLog read ptr %X/%X is on tli %u valid until %X/%X, current tli is %u",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil),
+				ThisTimeLineID);
+	}
+}
+
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f789fc1..f29fca3 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -231,12 +231,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
-	/* compute the current end-of-wal */
-	if (!RecoveryInProgress())
-		end_of_wal = GetFlushRecPtr();
-	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
 	ReplicationSlotAcquire(NameStr(*name));
 
 	PG_TRY();
@@ -263,6 +257,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		ctx->output_writer_private = p;
 
+		/*
+		 * We start reading xlog from the restart lsn, even though in
+		 * CreateDecodingContext we set the snapshot builder up using the
+		 * slot's candidate_restart_lsn. This means we might read xlog we don't
+		 * actually decode rows from, but the snapshot builder might need it to
+		 * get to a consistent point. The point we start returning data to
+		 * *users* at is the candidate restart lsn from the decoding context.
+		 */
 		startptr = MyReplicationSlot->data.restart_lsn;
 
 		CurrentResourceOwner = ResourceOwnerCreate(CurrentResourceOwner, "logical decoding");
@@ -270,8 +272,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		if (!RecoveryInProgress())
+			end_of_wal = GetFlushRecPtr();
+		else
+			end_of_wal = GetXLogReplayRecPtr(NULL);
+
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
-			 (ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
+			 (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
 			XLogRecord *record;
 			char	   *errm = NULL;
@@ -280,6 +288,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			if (errm)
 				elog(ERROR, "%s", errm);
 
+			/*
+			 * Now that we've set up the xlog reader state subsequent calls
+			 * pass InvalidXLogRecPtr to say "continue from last record"
+			 */
 			startptr = InvalidXLogRecPtr;
 
 			/*
@@ -299,6 +311,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			CHECK_FOR_INTERRUPTS();
 		}
 
+		/* Make sure timeline lookups use the start of the next record */
+		startptr = ctx->reader->EndRecPtr;
+
+		/*
+		 * The XLogReader will read a page past the valid end of WAL
+		 * because it doesn't know about timelines. When we switch
+		 * timelines and ask it for the first page on the new timeline it
+		 * will think it has it cached, but it'll have the old partial
+		 * page and say it can't find the next record. So flush the cache.
+		 */
+		XLogReaderInvalCache(ctx->reader);
+
 		tuplestore_donestoring(tupstore);
 
 		CurrentResourceOwner = old_resowner;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 7553cc4..4ccee95 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -20,12 +20,16 @@
  *		with the XLogRec* macros and functions. You can also decode a
  *		record that's already constructed in memory, without reading from
  *		disk, by calling the DecodeXLogRecord() function.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 #ifndef XLOGREADER_H
 #define XLOGREADER_H
 
 #include "access/xlogrecord.h"
+#include "nodes/pg_list.h"
 
 typedef struct XLogReaderState XLogReaderState;
 
@@ -139,26 +143,46 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+	/*
+	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to
+	 * at least readLen bytes)
+	 */
 	char	   *readBuf;
 
-	/* last read segment, segment offset, read length, TLI */
+	/*
+	 * last read segment, segment offset, read length, TLI for
+	 * data currently in readBuf.
+	 */
 	XLogSegNo	readSegNo;
 	uint32		readOff;
 	uint32		readLen;
 	TimeLineID	readPageTLI;
 
-	/* beginning of last page read, and its TLI  */
+	/*
+	 * beginning of prior page read, and its TLI. Doesn't
+	 * necessarily correspond to what's in readBuf, used for
+	 * timeline sanity checks.
+	 */
 	XLogRecPtr	latestPagePtr;
 	TimeLineID	latestPageTLI;
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID  currTLI;
+	/*
+	 * Endpoint of timeline in currTLI if it's historical or
+	 * InvalidXLogRecPtr if currTLI is the current timeline.
+	 */
+	XLogRecPtr	currTLIValidUntil;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
+	/* cached timeline history */
+	List	   *timelineHistory;
+
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 };
@@ -174,6 +198,9 @@ extern void XLogReaderFree(XLogReaderState *state);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 			   XLogRecPtr recptr, char **errormsg);
 
+/* Flush any cached page */
+extern void XLogReaderInvalCache(XLogReaderState *state);
+
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif   /* FRONTEND */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 1b9abce..86df8cf 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -50,4 +50,6 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 extern int read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state);
+
 #endif
-- 
2.1.0

0002-Allow-replication-slots-to-follow-failover.patchtext/x-patch; charset=US-ASCII; name=0002-Allow-replication-slots-to-follow-failover.patchDownload
From 6ac7cd6614e18c7cf9a9a761eb359eab690f6e78 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 15 Feb 2016 11:56:13 +0800
Subject: [PATCH 2/4] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

This patch adds a new backup label entry 'MIN FAILOVER SLOT LSN' that,
if present, indicates the minimum LSN needed by any failover slot that
is present in the base backup. Backup tools should check for this entry
and ensure they retain all xlogs including and after that point. It also
changes the return value of pg_start_backup(), the BASE_BACKUP walsender
command, etc to report the minimum WAL required by any failover slot
if this is a lower LSN than the redo position so that base backups
contain the WAL required for slots to work.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile      |   2 +-
 src/backend/access/transam/rmgr.c         |   1 +
 src/backend/access/transam/xlog.c         |  45 ++-
 src/backend/commands/dbcommands.c         |   3 +
 src/backend/replication/basebackup.c      |  12 -
 src/backend/replication/logical/decode.c  |   1 +
 src/backend/replication/logical/logical.c |  25 +-
 src/backend/replication/slot.c            | 579 ++++++++++++++++++++++++++++--
 src/backend/replication/slotfuncs.c       |   4 +-
 src/backend/replication/walsender.c       |   8 +-
 src/bin/pg_xlogdump/rmgrdesc.c            |   1 +
 src/include/access/rmgrlist.h             |   1 +
 src/include/replication/slot.h            |  69 +---
 13 files changed, 620 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d480f7..d7bb30e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6318,8 +6318,11 @@ StartupXLOG(void)
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
+	 *
+	 * If we're in archive recovery then non-failover slots are no
+	 * longer of any use and should be dropped during startup.
 	 */
-	StartupReplicationSlots();
+	StartupReplicationSlots(ArchiveRecoveryRequested);
 
 	/*
 	 * Startup logical state, needs to be setup now so we have proper data
@@ -9746,6 +9749,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	XLogRecPtr  slot_startpoint;
 	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -9892,6 +9896,16 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
+			/*
+			 * If failover slots are in use we must retain and transfer WAL
+			 * older than the redo location so that those slots can be replayed
+			 * from after a failover event.
+			 *
+			 * This MUST be at an xlog segment boundary so truncate the LSN
+			 * appropriately.
+			 */
+			slot_startpoint = (ReplicationSlotsComputeRequiredLSN(true)/ XLOG_SEG_SIZE) * XLOG_SEG_SIZE;
+
 			if (backup_started_in_recovery)
 			{
 				XLogRecPtr	recptr;
@@ -10060,6 +10074,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 						 backup_started_in_recovery ? "standby" : "master");
 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+		if (slot_startpoint != InvalidXLogRecPtr)
+			appendStringInfo(&labelfbuf,  "MIN FAILOVER SLOT LSN: %X/%X\n",
+						(uint32)(slot_startpoint>>32), (uint32)slot_startpoint);
+
 
 		/*
 		 * Okay, write the file, or return its contents to caller.
@@ -10153,10 +10171,33 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
+	 *
+	 * pg_basebackup etc expect to use this as the position to start copying
+	 * WAL from, so we should return the minimum of the slot start LSN and the
+	 * current redo position to make sure we get all WAL required by failover
+	 * slots.
+	 *
+	 * The min required LSN for failover slots is also available from the
+	 * 'MIN FAILOVER SLOT LSN' entry in the backup label file.
 	 */
+	if (slot_startpoint < startpoint)
+	{
+		List *history;
+		TimeLineID slot_start_tli;
+
+		/* Min LSN required by a slot may be on an older timeline. */
+		history = readTimeLineHistory(ThisTimeLineID);
+		slot_start_tli = tliOfPointInHistory(slot_startpoint, history);
+		list_free_deep(history);
+
+		if (slot_start_tli < starttli)
+			starttli = slot_start_tli;
+	}
+
 	if (starttli_p)
 		*starttli_p = starttli;
-	return startpoint;
+
+	return slot_startpoint < startpoint ? slot_startpoint : startpoint;
 }
 
 /* Error cleanup callback for pg_start_backup */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 88c3a49..76fc5c7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..4feb2ca 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we need we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
@@ -272,7 +275,7 @@ CreateInitDecodingContext(char *plugin,
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
 
-	ReplicationSlotsComputeRequiredXmin(true);
+	ReplicationSlotsUpdateRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -908,8 +911,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			MyReplicationSlot->effective_catalog_xmin = MyReplicationSlot->data.catalog_xmin;
 			SpinLockRelease(&MyReplicationSlot->mutex);
 
-			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
+			ReplicationSlotsUpdateRequiredXmin(false);
+			ReplicationSlotsUpdateRequiredLSN();
 		}
 	}
 	else
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a2c6524..3b970c7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -24,7 +24,18 @@
  * directory. Inside that directory the state file will contain the slot's
  * own data. Additional data can be stored alongside that file if required.
  * While the server is running, the state data is also cached in memory for
- * efficiency.
+ * efficiency. Non-failover slots are NOT subject to WAL logging and may
+ * be used on standbys (though that's only supported for physical slots at
+ * the moment). They use tempfile writes and swaps for crash safety.
+ *
+ * A failover slot created on a master node generates WAL records that
+ * maintain a copy of the slot on standby nodes. If a standby node is
+ * promoted the failover slot allows access to be restarted just as if the
+ * the original master node was being accessed, allowing for the timeline
+ * change. The replica considers slot positions when removing WAL to make
+ * sure it can satisfy the needs of slots after promotion.  For logical
+ * decoding slots the slot's internal state is kept up to date so it's
+ * ready for use after promotion.
  *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -101,10 +113,14 @@ static LWLockTranche ReplSlotIOLWLockTranche;
 static void ReplicationSlotDropAcquired(void);
 
 /* internal persistency functions */
-static void RestoreSlotFromDisk(const char *name);
+static void RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -220,7 +236,8 @@ ReplicationSlotValidateName(const char *name, int elevel)
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency persistency)
+					  ReplicationSlotPersistency persistency,
+					  bool failover)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -273,11 +290,23 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	Assert(!slot->in_use);
 	Assert(slot->active_pid == 0);
 	slot->data.persistency = persistency;
+
+	elog(LOG, "persistency is %i", (int)slot->data.persistency);
+
 	slot->data.xmin = InvalidTransactionId;
 	slot->effective_xmin = InvalidTransactionId;
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/* Slot timeline is unused and always zero */
+	slot->data.restart_tli = 0;
+
+	if (failover && RecoveryInProgress())
+		ereport(ERROR,
+				(errmsg("a failover slot may not be created on a replica"),
+				 errhint("Create the slot on the master server instead")));
+
+	slot->data.failover = failover;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -313,6 +342,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -335,7 +368,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && s->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -349,12 +386,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is already active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -411,16 +460,24 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
-ReplicationSlotDropAcquired(void)
+ReplicationSlotDropAcquired()
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false,
+		 took_allocation_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -428,8 +485,27 @@ ReplicationSlotDropAcquired(void)
 	 * If some other backend ran this code concurrently with us, we might try
 	 * to delete a slot with a certain name while someone else was trying to
 	 * create a slot with the same name.
+	 *
+	 * If called with the lock already held it MUST be held in
+	 * EXCLUSIVE mode.
 	 */
-	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotAllocationLock))
+	{
+		took_allocation_lock = true;
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	}
+
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), NAMEDATALEN);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
 
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
@@ -459,7 +535,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -477,18 +557,27 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
 	 * limits.
 	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	/*
 	 * If removing the directory fails, the worst thing that will happen is
@@ -504,7 +593,8 @@ ReplicationSlotDropAcquired(void)
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
-	LWLockRelease(ReplicationSlotAllocationLock);
+	if (took_allocation_lock)
+		LWLockRelease(ReplicationSlotAllocationLock);
 }
 
 /*
@@ -544,6 +634,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -565,7 +658,7 @@ ReplicationSlotPersist(void)
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  */
 void
-ReplicationSlotsComputeRequiredXmin(bool already_locked)
+ReplicationSlotsUpdateRequiredXmin(bool already_locked)
 {
 	int			i;
 	TransactionId agg_xmin = InvalidTransactionId;
@@ -610,10 +703,20 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 }
 
 /*
- * Compute the oldest restart LSN across all slots and inform xlog module.
+ * Update the xlog module's copy of the minimum restart lsn across all slots
  */
 void
-ReplicationSlotsComputeRequiredLSN(void)
+ReplicationSlotsUpdateRequiredLSN(void)
+{
+	XLogSetReplicationSlotMinimumLSN(ReplicationSlotsComputeRequiredLSN(false));
+}
+
+/*
+ * Compute the oldest restart LSN across all slots (or optionally
+ * only failover slots) and return it.
+ */
+XLogRecPtr
+ReplicationSlotsComputeRequiredLSN(bool failover_only)
 {
 	int			i;
 	XLogRecPtr	min_required = InvalidXLogRecPtr;
@@ -625,14 +728,19 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		failover;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		failover = s->data.failover;
 		SpinLockRelease(&s->mutex);
 
+		if (failover_only && !failover)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -640,7 +748,7 @@ ReplicationSlotsComputeRequiredLSN(void)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	XLogSetReplicationSlotMinimumLSN(min_required);
+	return min_required;
 }
 
 /*
@@ -649,7 +757,7 @@ ReplicationSlotsComputeRequiredLSN(void)
  * Returns InvalidXLogRecPtr if logical decoding is disabled or no logical
  * slots exist.
  *
- * NB: this returns a value >= ReplicationSlotsComputeRequiredLSN(), since it
+ * NB: this returns a value >= ReplicationSlotsUpdateRequiredLSN(), since it
  * ignores physical replication slots.
  *
  * The results aren't required frequently, so we don't maintain a precomputed
@@ -747,6 +855,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -779,12 +926,13 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
+	 * ReplicationSlotsUpdateRequiredLSN() is used. In the unlikely case that
 	 * this happens we'll just retry.
 	 */
 	while (true)
@@ -821,12 +969,12 @@ ReplicationSlotReserveWal(void)
 		}
 
 		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
+		 * concurrent ReplicationSlotsUpdateRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
 		 * more than twice.
 		 */
@@ -878,7 +1026,7 @@ CheckPointReplicationSlots(void)
  * needs to be run before we start crash recovery.
  */
 void
-StartupReplicationSlots(void)
+StartupReplicationSlots(bool drop_nonfailover_slots)
 {
 	DIR		   *replication_dir;
 	struct dirent *replication_de;
@@ -917,7 +1065,7 @@ StartupReplicationSlots(void)
 		}
 
 		/* looks like a slot in a normal state, restore */
-		RestoreSlotFromDisk(replication_de->d_name);
+		RestoreSlotFromDisk(replication_de->d_name, drop_nonfailover_slots);
 	}
 	FreeDir(replication_dir);
 
@@ -926,8 +1074,8 @@ StartupReplicationSlots(void)
 		return;
 
 	/* Now that we have recovered all the data, compute replication xmin */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 }
 
 /* ----
@@ -996,6 +1144,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -1006,15 +1156,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1047,6 +1200,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1116,7 +1288,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
  * Load a single slot from disk into memory.
  */
 static void
-RestoreSlotFromDisk(const char *name)
+RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots)
 {
 	ReplicationSlotOnDisk cp;
 	int			i;
@@ -1235,10 +1407,21 @@ RestoreSlotFromDisk(const char *name)
 						path, checksum, cp.checksum)));
 
 	/*
-	 * If we crashed with an ephemeral slot active, don't restore but delete
-	 * it.
+	 * If we crashed with an ephemeral slot active, don't restore but
+	 * delete it.
+	 *
+	 * Similarly, if we're in archive recovery and will be running as
+	 * a standby (when drop_nonfailover_slots is set), non-failover
+	 * slots can't be relied upon. Logical slots might have a catalog
+	 * xmin lower than reality because the original slot on the master
+	 * advanced past the point the stale slot on the replica is stuck
+	 * at. Additionally slots might have been copied while being
+	 * written to if the basebackup copy method was not atomic.
+	 * Failover slots are safe since they're WAL-logged and follow the
+	 * master's slot position.
 	 */
-	if (cp.slotdata.persistency != RS_PERSISTENT)
+	if (cp.slotdata.persistency != RS_PERSISTENT
+			|| (drop_nonfailover_slots && !cp.slotdata.failover))
 	{
 		sprintf(path, "pg_replslot/%s", name);
 
@@ -1249,6 +1432,14 @@ RestoreSlotFromDisk(const char *name)
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
 		fsync_fname("pg_replslot", true);
+
+		if (cp.slotdata.persistency == RS_PERSISTENT)
+		{
+			ereport(LOG,
+					(errmsg("dropped non-failover slot %s during archive recovery",
+							 NameStr(cp.slotdata.name))));
+		}
+
 		return;
 	}
 
@@ -1285,5 +1476,319 @@ RestoreSlotFromDisk(const char *name)
 	if (!restored)
 		ereport(PANIC,
 				(errmsg("too many replication slots active before shutdown"),
-				 errhint("Increase max_replication_slots and try again.")));
+				 errhint("Increase max_replication_slots (currently %u) and try again.",
+					 max_replication_slots)));
+}
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * We're in redo, but someone could still create a local
+	 * non-failover slot and race with us unless we take the
+	 * allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning in case there's an existing slot with the same
+		 * name.
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Existing slot with same name? It could be our failover slot
+		 * to update or a non-failover slot with a conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * A local non-failover slot exists with the same name as
+		 * the failover slot we're creating.
+		 *
+		 * Clobber the client, drop its slot, and carry on with
+		 * our business.
+		 *
+		 * First we must temporarily release the allocation lock while
+		 * we try to terminate the process that holds the slot, since
+		 * we don't want to hold the LWlock for ages. We'll reacquire
+		 * it later.
+		 */
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/* We might race with other clients, so retry-loop */
+		do
+		{
+			int active_pid = slot->active_pid;
+			int max_sleep_micros = 120 * 10000000;
+			int micros_per_sleep = 10000000;
+
+			if (active_pid != 0)
+			{
+				ereport(INFO,
+						(errmsg("terminating active connection by pid %u to local slot %s because of conflict with recovery",
+							active_pid, NameStr(slot->data.name))));
+
+				if (kill(active_pid, SIGTERM))
+					elog(DEBUG1, "failed to signal pid %u to terminate on slot conflict: %m",
+							active_pid);
+
+				/*
+				 * No way to wait for the process since it's not a child
+				 * of ours and there's no latch to set, so poll.
+				 *
+				 * We're checking this without any locks held, but
+				 * we'll recheck when we attempt to drop the slot.
+				 */
+				while (slot->in_use && slot->active_pid == active_pid
+						&& max_sleep_micros > 0)
+				{
+					usleep(micros_per_sleep);
+					max_sleep_micros -= micros_per_sleep;
+				}
+
+				if (max_sleep_micros <= 0)
+					elog(WARNING, "process %u is taking too long to terminate after SIGTERM",
+							slot->active_pid);
+			}
+
+			if (active_pid == 0)
+			{
+				/* Try to acquire and drop the slot */
+				SpinLockAcquire(&slot->mutex);
+
+				if (slot->active_pid != 0)
+				{
+					/* Lost the race, go around */
+				}
+				else
+				{
+					/* Claim the slot for ourselves */
+					slot->active_pid = MyProcPid;
+					MyReplicationSlot = slot;
+				}
+				SpinLockRelease(&slot->mutex);
+			}
+
+			if (slot->active_pid == MyProcPid)
+			{
+				NameData slotname;
+				strncpy(NameStr(slotname), NameStr(slot->data.name), NAMEDATALEN);
+				(NameStr(slotname))[NAMEDATALEN-1] = '\0';
+
+				/*
+				 * Reclaim the allocation lock and THEN drop the slot,
+				 * so nobody else can grab the name until we've
+				 * finished redo.
+				 */
+				LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+				ReplicationSlotDropAcquired();
+				/* We clobbered the duplicate, treat it as new */
+				found_duplicate = false;
+
+				ereport(WARNING,
+						(errmsg("dropped local replication slot %s because of conflict with recovery",
+								NameStr(slotname)),
+						 errdetail("A failover slot with the same name was created on the master server")));
+			}
+		}
+		while (slot->in_use);
+	}
+
+	Assert(LWLockHeldByMe(ReplicationSlotAllocationLock));
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			Assert(!slot->in_use);
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+
+		ReplicationSlotsUpdateRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredLSN();
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+	{
+		ReplicationSlotDropAcquired();
+	}
+	else
+	{
+		elog(WARNING, "failover slot %s not found during redo of drop",
+				slotname);
+	}
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..f430714 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -57,7 +57,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -120,7 +120,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..1583862 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
 	}
 
 	initStringInfo(&output_message);
@@ -1523,7 +1523,7 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 	}
 
 	/*
@@ -1619,7 +1619,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredXmin(false);
 	}
 }
 
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8be8ab6..cdcbd37 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -11,69 +11,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -155,7 +98,7 @@ extern void ReplicationSlotsShmemInit(void);
 
 /* management of individual slots */
 extern void ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency p);
+					  ReplicationSlotPersistency p, bool failover);
 extern void ReplicationSlotPersist(void);
 extern void ReplicationSlotDrop(const char *name);
 
@@ -167,12 +110,14 @@ extern void ReplicationSlotMarkDirty(void);
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
 extern void ReplicationSlotReserveWal(void);
-extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
-extern void ReplicationSlotsComputeRequiredLSN(void);
+extern void ReplicationSlotsUpdateRequiredXmin(bool already_locked);
+extern void ReplicationSlotsUpdateRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
+extern XLogRecPtr ReplicationSlotsComputeRequiredLSN(bool failover_only);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
-extern void StartupReplicationSlots(void);
+extern void StartupReplicationSlots(bool drop_nonfailover_slots);
 extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
-- 
2.1.0

0003-Add-the-UI-and-documentation-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0003-Add-the-UI-and-documentation-for-failover-slots.patchDownload
From bce33b0732e0498b25d6673b49b86c1eb09ab894 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 15 Feb 2016 12:00:59 +0800
Subject: [PATCH 3/4] Add the UI and documentation for failover slots

Expose failover slots to the user.

Add a new 'failover' argument to pg_create_logical_replication_slot and
pg_create_physical_replication_slot . Report if a slot is a failover
slot in pg_catalog.pg_replication_slots. Accept a new FAILOVER keyword
argument in CREATE_REPLICATION_SLOT on the walsender protocol.

Document the existence of failover slots support and how to use them.
---
 contrib/test_decoding/expected/ddl.out | 41 ++++++++++++++++++---
 contrib/test_decoding/sql/ddl.sql      | 17 ++++++++-
 doc/src/sgml/catalogs.sgml             | 10 +++++
 doc/src/sgml/func.sgml                 | 24 ++++++++----
 doc/src/sgml/high-availability.sgml    | 67 ++++++++++++++++++++++++++++++++--
 doc/src/sgml/logicaldecoding.sgml      | 52 +++++++++++++++++---------
 doc/src/sgml/protocol.sgml             | 24 ++++++++++--
 src/backend/catalog/system_views.sql   | 12 +++++-
 src/backend/replication/repl_gram.y    | 13 ++++++-
 src/backend/replication/slotfuncs.c    | 13 +++++--
 src/backend/replication/walsender.c    |  4 +-
 src/include/catalog/pg_proc.h          |  6 +--
 src/include/nodes/replnodes.h          |  1 +
 src/include/replication/slot.h         |  1 +
 src/test/regress/expected/rules.out    |  3 +-
 15 files changed, 237 insertions(+), 51 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 57a1289..5b2f34a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -9,6 +9,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 ERROR:  replication slot "regression_slot" already exists
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
+ERROR:  replication slot "regression_slot" already exists
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 ERROR:  replication slot name "Invalid Name" contains invalid character
@@ -58,11 +61,37 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
-    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal 
------------------+---------------+-----------+--------+------------------+-------------------+----------
- regression_slot | test_decoding | logical   | f      | t                | t                 | t
+    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+-----------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ regression_slot | test_decoding | logical   | f      | t                | t                 | t        | f
+(1 row)
+
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+ ?column? 
+----------
+ init
+(1 row)
+
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+   slot_name   |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+---------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ failover_slot | test_decoding | logical   | f      | t                | t                 | t        | t
+(1 row)
+
+SELECT pg_drop_replication_slot('failover_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
 (1 row)
 
 /*
@@ -673,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | active | active_pid | failover | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
+-----------+--------+-----------+--------+----------+--------+------------+----------+------+--------------+-------------+---------------------
 (0 rows)
 
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index e311c59..f64b21c 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -4,6 +4,8 @@ SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 
@@ -22,16 +24,27 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 SELECT pg_drop_replication_slot('repl');
 
-
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
 /* check whether status function reports us, only reproduceable columns */
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
 
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+SELECT pg_drop_replication_slot('failover_slot');
+
 /*
  * Check that changes are handled correctly when interleaved with ddl
  */
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 412c845..053b91a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5377,6 +5377,16 @@
      </row>
 
      <row>
+      <entry><structfield>failover</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       True if this slot is a failover slot; see
+       <xref linkend="streaming-replication-slots-failover"/>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>xmin</structfield></entry>
       <entry><type>xid</type></entry>
       <entry></entry>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f9eea76..ef49bd7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17420,7 +17420,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>, <parameter>immediately_reserve</> <type>boolean</> </optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <optional><parameter>immediately_reserve</> <type>boolean</></optional>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17431,7 +17431,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         when <literal>true</>, specifies that the <acronym>LSN</> for this
         replication slot be reserved immediately; otherwise
         the <acronym>LSN</> is reserved on first connection from a streaming
-        replication client. Streaming changes from a physical slot is only
+        replication client. If <literal>failover</literal> is <literal>true</literal>
+        then the slot is created as a failover slot; see <xref
+        linkend="streaming-replication-slots-failover">.
+        Streaming changes from a physical slot is only
         possible with the streaming-replication protocol &mdash;
         see <xref linkend="protocol-replication">. This function corresponds
         to the replication protocol command <literal>CREATE_REPLICATION_SLOT
@@ -17460,7 +17463,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_logical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>)</function></literal>
+        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17468,8 +17471,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new logical (decoding) replication slot named
         <parameter>slot_name</parameter> using the output plugin
-        <parameter>plugin</parameter>.  A call to this function has the same
-        effect as the replication protocol command
+        <parameter>plugin</parameter>. If <literal>failover</literal>
+        is <literal>true</literal> the slot is created as a failover
+        slot; see <xref linkend="streaming-replication-slots-failover">. A call to
+        this function has the same effect as the replication protocol command
         <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
        </entry>
       </row>
@@ -17485,7 +17490,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         (<parameter>location</parameter> <type>pg_lsn</type>, <parameter>xid</parameter> <type>xid</type>, <parameter>data</parameter> <type>text</type>)
        </entry>
        <entry>
-        Returns changes in the slot <parameter>slot_name</parameter>, starting
+        eturns changes in the slot <parameter>slot_name</parameter>, starting
         from the point at which since changes have been consumed last.  If
         <parameter>upto_lsn</> and <parameter>upto_nchanges</> are NULL,
         logical decoding will continue until end of WAL.  If
@@ -17495,7 +17500,12 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         stop when the number of rows produced by decoding exceeds
         the specified value.  Note, however, that the actual number of
         rows returned may be larger, since this limit is only checked after
-        adding the rows produced when decoding each new transaction commit.
+        adding the rows produced when decoding each new transaction commit,
+        so at least one transaction is always returned. The returned changes
+        are consumed and will not be returned by a subsequent calls to
+        <function>pg_logical_slot_get_changes</function>, though a server
+        crash may cause recently consumed changes to be replayed again after
+        recovery.
        </entry>
       </row>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6cb690c..1624d51 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -859,7 +859,8 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
      <xref linkend="functions-recovery-info-table"> for details).
      The last WAL receive location in the standby is also displayed in the
      process status of the WAL receiver process, displayed using the
-     <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
+     <command>ps</> command (see <xref linkend="monitoring-ps"> for details)
+     and in the <literal>pg_stat_replication</literal> view.
     </para>
     <para>
      You can retrieve a list of WAL sender processes via the
@@ -871,10 +872,15 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
      <function>pg_last_xlog_receive_location</> on the standby might indicate
      network delay, or that the standby is under heavy load.
     </para>
+    <para>
+     Compare xlog locations and measure lag using
+     <link linkend="functions-admin-backup"><function>pg_xlog_location_diff(...)</function></link>
+     in a query over <literal>pg_stat_replication</literal>.
+    </para>
    </sect3>
   </sect2>
 
-  <sect2 id="streaming-replication-slots">
+  <sect2 id="streaming-replication-slots" xreflabel="Replication slots">
    <title>Replication Slots</title>
    <indexterm>
     <primary>replication slot</primary>
@@ -885,7 +891,10 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     not remove WAL segments until they have been received by all standbys,
     and that the master does not remove rows which could cause a
     <link linkend="hot-standby-conflict">recovery conflict</> even when the
-    standby is disconnected.
+    standby is disconnected. They allow clients to receive a stream of
+    changes in in ordered, consistent manner - either raw WAL from a physical
+    replication slot or a logical stream of row changes from a
+    <link linkend="logicaldecoding-slots">logical replication slot</link>.
    </para>
    <para>
     In lieu of using replication slots, it is possible to prevent the removal
@@ -906,6 +915,17 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     and the latter often needs to be set to a high value to provide adequate
     protection.  Replication slots overcome these disadvantages.
    </para>
+   <para>
+    Because replication slots cause the server to retain transaction logs
+    in <filename>pg_xlog</filename> it is important to monitor how far slots
+    are lagging behind the master in order to prevent the disk from filling
+    up and interrupting the master's operation. A query like:
+    <programlisting>
+      SELECT *, pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn) AS lag_bytes
+      FROM pg_replication_slots;
+    </programlisting>
+    will provide an indication of how far a slot is lagging.
+   </para>
    <sect3 id="streaming-replication-slots-manipulation">
     <title>Querying and manipulating replication slots</title>
     <para>
@@ -949,6 +969,47 @@ primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
+
+   <sect3 id="streaming-replication-slots-failover" xreflabel="Failover slots">
+     <title>Failover slots</title>
+
+     <para>
+      Normally a replication slot is not preserved across backup and restore
+      (such as by <application>pg_basebackup</application>) and is not
+      replicated to standbys. Slots are <emphasis>automatically
+      dropped</emphasis> when starting up as a streaming replica or in archive
+      recovery (PITR) mode.
+     </para>
+
+     <para>
+      To make it possible to for an application to consistently follow
+      failover when a replica is promoted to a new master a slot may be
+      created as a <emphasis>failover slot</emphasis>. A failover slot may
+      only be created, replayed from or dropped on a master server. Changes to
+      the slot are written to WAL and replicated to standbys. When a standby
+      is promoted applications may connect to the slot on the standby and
+      resume replay from it at a consistent point, as if it was the original
+      master. Failover slots may not be used to replay from a standby before
+      promotion.
+     </para>
+
+     <para>
+      Non-failover slots may be created on and used from a replica. This is
+      currently limited to physical slots as logical decoding is not supported
+      on replica server.
+     </para>
+
+     <para>
+      When a failover slot created on the master has the same name as a
+      non-failover slot on a replica server the non-failover slot will be
+      automatically dropped. Any client currently connected will be
+      disconnected with an error indicating a conflict with recovery. It
+      is strongly recommended that you avoid creating failover slots with
+      the same name as slots on replicas.
+     </para>
+
+   </sect3>
+
   </sect2>
 
   <sect2 id="cascading-replication">
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index e841348..7f6a73d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -12,15 +12,17 @@
 
   <para>
    Changes are sent out in streams identified by logical replication slots.
-   Each stream outputs each change exactly once.
+   Each stream outputs each change once, though repeats are possible after a
+   server crash.
   </para>
 
   <para>
    The format in which those changes are streamed is determined by the output
-   plugin used.  An example plugin is provided in the PostgreSQL distribution.
-   Additional plugins can be
-   written to extend the choice of available formats without modifying any
-   core code.
+   plugin used.  An example plugin (test_decoding) is provided in the
+   PostgreSQL distribution.  Additional plugins can be written to extend the
+   choice of available formats without modifying any core code.
+  </para>
+  <para>
    Every output plugin has access to each individual new row produced
    by <command>INSERT</command> and the new row version created
    by <command>UPDATE</command>.  Availability of old row versions for
@@ -192,7 +194,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-slots" xreflabel="Logical Replication Slots">
     <title>Replication Slots</title>
 
     <indexterm>
@@ -201,20 +203,18 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </indexterm>
 
     <para>
+     The general concepts of replication slots are discussed under
+     <xref linkend="streaming-replication-slots">. This topic only covers
+     specifics for logical slots.
+    </para>
+
+    <para>
      In the context of logical replication, a slot represents a stream of
      changes that can be replayed to a client in the order they were made on
      the origin server. Each slot streams a sequence of changes from a single
-     database, sending each change exactly once (except when peeking forward
-     in the stream).
+     database, sending each change only once.
     </para>
 
-    <note>
-     <para><productname>PostgreSQL</productname> also has streaming replication slots
-     (see <xref linkend="streaming-replication">), but they are used somewhat
-     differently there.
-     </para>
-    </note>
-
     <para>
      A replication slot has an identifier that is unique across all databases
      in a <productname>PostgreSQL</productname> cluster. Slots persist
@@ -243,9 +243,22 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
       even when there is no connection using them. This consumes storage
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
-      slot.  So if a slot is no longer required it should be dropped.
+      slot.  If a slot is no longer required it should be dropped to prevent
+      <filename>pg_xlog</filename> from filling up and (for logical slots)
+      the system catalogs from bloating.
      </para>
     </note>
+
+    <para>
+     A replication slot keeps track of the oldest needed WAL position that the
+     application may need to restart replay from. It does <emphasis>not</emphasis>
+     guarantee never to replay the same data twice and updates to the restart
+     position are not immediately flushed to disk so they may be lost if the
+     server crashes. The client application is responsible for keeping track of
+     the point it has replayed up to and should request that replay restart at
+     that point when it reconnects by passing the last-replayed LSN to the
+     start replication command.
+    </para>
    </sect2>
 
    <sect2>
@@ -268,7 +281,10 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      SNAPSHOT</literal></link> to read the state of the database at the moment
      the slot was created. This transaction can then be used to dump the
      database's state at that point in time, which afterwards can be updated
-     using the slot's contents without losing any changes.
+     using the slot's contents without losing any changes. The exported snapshot
+     remains valid until the connection that created it runs another command
+     or disconnects. It may be imported into another connection and re-exported
+     to preserve it longer.
     </para>
    </sect2>
   </sect1>
@@ -280,7 +296,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     The commands
     <itemizedlist>
      <listitem>
-      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable></literal></para>
+      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable> <optional>FAILOVER</optional></literal></para>
      </listitem>
 
      <listitem>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 1a596cd..cbf523d 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1434,13 +1434,14 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> <optional><literal>RESERVE_WAL</></> | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> } <optional><literal>FAILOVER</></>
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
      <para>
       Create a physical or logical replication
-      slot. See <xref linkend="streaming-replication-slots"> for more about
+      slot. See <xref linkend="streaming-replication-slots"> and
+      <xref linkend="logicaldecoding-slots"> for more about
       replication slots.
      </para>
      <variablelist>
@@ -1468,12 +1469,23 @@ The commands accepted in walsender mode are:
        <term><literal>RESERVE_WAL</></term>
        <listitem>
         <para>
-         Specify that this physical replication reserves <acronym>WAL</>
+         Specify that this physical replication slot reserves <acronym>WAL</>
          immediately.  Otherwise, <acronym>WAL</> is only reserved upon
          connection from a streaming replication client.
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>FAILOVER</></term>
+       <listitem>
+        <para>
+         Create this slot as a <link linkend="streaming-replication-slots-failover">
+         failover slot</link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </listitem>
   </varlistentry>
@@ -1829,6 +1841,12 @@ The commands accepted in walsender mode are:
       to process the output for streaming.
      </para>
 
+     <para>
+      Logical replication automatically follows timeline switches. It is
+      not necessary or possible to supply a <literal>TIMELINE</literal>
+      option like in physical replication.
+     </para>
+
      <variablelist>
       <varlistentry>
        <term><literal>SLOT</literal> <replaceable class="parameter">slot_name</></term>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..b4f8fbe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -698,6 +698,7 @@ CREATE VIEW pg_replication_slots AS
             D.datname AS database,
             L.active,
             L.active_pid,
+            L.failover,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
@@ -943,12 +944,21 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
     IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-    OUT slot_name name, OUT xlog_position pg_lsn)
+    IN failover boolean DEFAULT false, OUT slot_name name,
+    OUT xlog_position pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_physical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot(
+    IN slot_name name, IN plugin name, IN failover boolean DEFAULT false,
+    OUT slot_name text, OUT xlog_position pg_lsn)
+RETURNS RECORD
+LANGUAGE INTERNAL
+STRICT VOLATILE
+AS 'pg_create_logical_replication_slot';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d93db88..1574f24 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_FAILOVER
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_failover
 
 %%
 
@@ -184,23 +186,25 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->reserve_wal = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +280,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_failover:
+			K_FAILOVER						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f430714..abc450d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
@@ -41,6 +42,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	bool 		immediately_reserve = PG_GETARG_BOOL(1);
+	bool		failover = PG_GETARG_BOOL(2);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
@@ -57,7 +59,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, failover);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -96,6 +98,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	Name		plugin = PG_GETARG_NAME(1);
+	bool		failover = PG_GETARG_BOOL(2);
 
 	LogicalDecodingContext *ctx = NULL;
 
@@ -120,7 +123,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, failover);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
@@ -174,7 +177,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 10
+#define PG_GET_REPLICATION_SLOTS_COLS 11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -224,6 +227,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		XLogRecPtr	restart_lsn;
 		XLogRecPtr	confirmed_flush_lsn;
 		pid_t		active_pid;
+		bool		failover;
 		Oid			database;
 		NameData	slot_name;
 		NameData	plugin;
@@ -246,6 +250,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			namecpy(&plugin, &slot->data.plugin);
 
 			active_pid = slot->active_pid;
+			failover = slot->data.failover;
 		}
 		SpinLockRelease(&slot->mutex);
 
@@ -276,6 +281,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
+		values[i++] = BoolGetDatum(failover);
+
 		if (xmin != InvalidTransactionId)
 			values[i++] = TransactionIdGetDatum(xmin);
 		else
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1583862..efdbfd1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, cmd->failover);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, cmd->failover);
 	}
 
 	initStringInfo(&output_message);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1c0ef9a..d14ff7a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5064,13 +5064,13 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 /* replication slots */
-DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16" "{19,16,19,3220}" "{i,i,o,o}" "{slot_name,immediately_reserve,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16 16" "{19,16,16,19,3220}" "{i,i,i,o,o}" "{slot_name,immediately_reserve,failover,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
+DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,16,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,failover,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
-DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19" "{19,19,25,3220}" "{i,i,o,o}" "{slot_name,plugin,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
 DATA(insert OID = 3782 (  pg_logical_slot_get_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v u 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,25}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ _null_ pg_logical_slot_get_changes _null_ _null_ _null_ ));
 DESCR("get changes from replication slot");
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..a8fa9d5 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		failover;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index cdcbd37..9e23a29 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2bdba2d..f5dd4a8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1414,11 +1414,12 @@ pg_replication_slots| SELECT l.slot_name,
     d.datname AS database,
     l.active,
     l.active_pid,
+    l.failover,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
     l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, failover, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.1.0

failover-slot-testapplication/octet-stream; name=failover-slot-testDownload
#15Petr Jelinek
petr@2ndquadrant.com
In reply to: Craig Ringer (#14)
Re: WIP: Failover Slots

Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h. It's maybe slightly overdocumented but given the
complicated way the timeline reading works it's probably warranted.

0002:
+                /*
+                 * No way to wait for the process since it's not a child
+                 * of ours and there's no latch to set, so poll.
+                 *
+                 * We're checking this without any locks held, but
+                 * we'll recheck when we attempt to drop the slot.
+                 */
+                while (slot->in_use && slot->active_pid == active_pid
+                        && max_sleep_micros > 0)
+                {
+                    usleep(micros_per_sleep);
+                    max_sleep_micros -= micros_per_sleep;
+                }

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc. Also you do usleep for 10s which is quite long. I'd
prefer the classic wait for latch with timeout and pg crash check
here. And even if we go with usleep, then it should be 1s not 10s and
pg_usleep instead of usleep.

0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Other than that it looks good to me.

About other things discussed in this thread. Yes it makes sense in
certain situations to handle this outside of WAL and that does require
notions of nodes, etc. That being said, the timeline following is
needed even if this is handled outside of WAL. And once timeline
following is in, the slots can be handled by the replication solution
itself which is good. But I think the failover slots are still a good
thing to have - it provides HA solution for anything that uses slots,
and that includes physical replication as well. If the specific
logical replication solution wants to handle it for some reason itself
outside of WAL, it can create non-failover slot so in my opinion we
ideally need both types of slots (and that's exactly what this patch
gives us).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Craig Ringer
craig@2ndquadrant.com
In reply to: Petr Jelinek (#15)
3 attachment(s)
Re: WIP: Failover Slots

On 16 February 2016 at 01:23, Petr Jelinek <petr@2ndquadrant.com> wrote:

Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h.

I don't see the broken indentation. Not sure what you mean.

+                while (slot->in_use && slot->active_pid == active_pid
+                        && max_sleep_micros > 0)
+                {
+                    usleep(micros_per_sleep);
+                    max_sleep_micros -= micros_per_sleep;
+                }

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc.

Yeah. I was thinking - incorrectly - that I couldn't use a latch during
recovery.

Revision attached. There was a file missing from the patch too.

0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Yeah, probably worth doing. We'll see how this patch goes.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Allow-logical-slots-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-logical-slots-to-follow-timeline-switches.patchDownload
From 5f41d6a6694ca4e5f335d4cb9e4e8f4f73896015 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 11 Feb 2016 10:44:14 +0800
Subject: [PATCH 1/4] Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c        |  43 ++++-
 src/backend/access/transam/xlogutils.c         | 214 +++++++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c |  38 ++++-
 src/include/access/xlogreader.h                |  33 +++-
 src/include/access/xlogutils.h                 |   2 +
 5 files changed, 295 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..5899f44 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 
@@ -116,6 +119,9 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+	/* Will be loaded on first read */
+	state->timelineHistory = NULL;
+
 	return state;
 }
 
@@ -135,6 +141,13 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifdef FRONTEND
+	/* FE code doesn't use this and we can't list_free_deep on FE */
+	Assert(state->timelineHistory == NULL);
+#else
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,9 +221,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 
 	if (RecPtr == InvalidXLogRecPtr)
 	{
+		/* No explicit start point, read the record after the one we just read */
 		RecPtr = state->EndRecPtr;
 
 		if (state->ReadRecPtr == InvalidXLogRecPtr)
+			/* allow readPageTLI to go backward */
 			randAccess = true;
 
 		/*
@@ -223,6 +238,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	else
 	{
 		/*
+		 * Caller supplied a position to start at.
+		 *
 		 * In this case, the passed-in record pointer should already be
 		 * pointing to a valid record starting position.
 		 */
@@ -309,8 +326,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		/* XXX: more validation should be done here */
 		if (total_len < SizeOfXLogRecord)
 		{
-			report_invalid_record(state, "invalid record length at %X/%X",
-								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+								  SizeOfXLogRecord, total_len);
 			goto err;
 		}
 		gotheader = false;
@@ -466,9 +484,7 @@ err:
 	 * Invalidate the xlog page we've cached. We might read from a different
 	 * source after failure.
 	 */
-	state->readSegNo = 0;
-	state->readOff = 0;
-	state->readLen = 0;
+	XLogReaderInvalCache(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -599,9 +615,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 {
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X",
-							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+							  SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
 	if (record->xl_rmid > RM_MAX_ID)
@@ -1337,3 +1353,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 	return true;
 }
+
+/*
+ * Invalidate the xlog reader's cached page to force a re-read
+ */
+void
+XLogReaderInvalCache(XLogReaderState *state)
+{
+	state->readSegNo = 0;
+	state->readOff = 0;
+	state->readLen = 0;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 444e218..85bac01 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -7,6 +7,9 @@
  * This file contains support routines that are used by XLOG replay functions.
  * None of this code is used during normal system operation.
  *
+ * Unlike xlogreader.c this is only compiled for the backend so it may use
+ * elog, etc.
+ *
  *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -21,6 +24,7 @@
 
 #include "miscadmin.h"
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -651,6 +655,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
 	static uint32 sendOff = 0;
+	/* So we notice if asked for the same seg on a new tli: */
+	static TimeLineID lastTLI = 0;
 
 	p = buf;
 	recptr = startptr;
@@ -664,11 +670,11 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 
 		startoff = recptr % XLogSegSize;
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		/* Do we need to switch to a new xlog segment? */
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || lastTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
-			/* Switch to another logfile segment */
 			if (sendFile >= 0)
 				close(sendFile);
 
@@ -692,6 +698,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			lastTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -759,28 +766,66 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it after each loop because if we're in
+		 * recovery as a cascading standby the current timeline
+		 * might've become historical.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			flushptr = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might
+			 * have to wait for the desired record to be generated
+			 * (or, for a standby, received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				flushptr = GetFlushRecPtr();
+			}
+			else
+				flushptr = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= flushptr)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			flushptr = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= flushptr)
+		{
+			/*
+			 * We're on a historical timeline, limit reading to the
+			 * switch point where we moved to the next timeline.
+			 *
+			 * We could just jump to the next timeline early since
+			 * the whole segment the last page is on got copied onto
+			 * the new timeline, but this is simpler.
+			 */
+			flushptr = state->currTLIValidUntil;
+
+			/*
+			 * FIXME: Setting pageTLI to the TLI the *record* we
+			 * want is on can be slightly wrong; the page might
+			 * begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will've been copied
+			 * from the prior timeline. We should really read the
+			 * page header. It's pretty harmless though as nothing
+			 * cares so long as the timeline doesn't go backwards.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	/* more than one block available */
@@ -793,7 +838,142 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else
 		count = flushptr - targetPagePtr;
 
-	XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, *pageTLI, targetPagePtr, count);
 
 	return count;
 }
+
+/*
+ * Figure out what timeline to look on for the record the xlogreader
+ * is being asked asked to read, in currRecPtr. This may be used
+ * to determine which xlog segment file to open, etc.
+ *
+ * It depends on:
+ *
+ * - Whether we're reading a record immediately following one we read
+ *   before or doing a random read. We can only use the cached
+ *   timeline info if we're reading sequentially.
+ *
+ * - Whether the timeline of the prior record read was historical or
+ *   the current timeline and, if historical, on where it's valid up
+ *   to. On a historical timeline we need to avoid reading past the
+ *   timeline switch point. The records after it are probably invalid,
+ *   but worse, they might be valid but *different*.
+ *
+ * - If the current timeline became historical since the last record
+ *   we read. We need to make sure we don't read past the switch
+ *   point.
+ *
+ * None of this has any effect unless callbacks use currTLI to
+ * determine which timeline to read from and optionally use the
+ * validity limit to avoid reading past the valid end of a page.
+ *
+ * Note that an xlog segment may contain data from an older timeline
+ * if it was copied during a timeline switch. Callers may NOT assume
+ * that currTLI is the timeline that will be in a given page's
+ * xlp_tli; the page may begin on older timeline.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state)
+{
+	if (state->timelineHistory == NULL)
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+	if (state->currTLIValidUntil == InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0)
+	{
+		/*
+		 * We were reading what was the current timeline but it became
+		 * historical. Either we were replaying as a replica and got
+		 * promoted or we're replaying as a cascading replica from a
+		 * parent that got promoted.
+		 *
+		 * Force a re-read of the timeline history.
+		 */
+		list_free_deep(state->timelineHistory);
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		elog(DEBUG2, "timeline %u became historical during decoding",
+				state->currTLI);
+
+		/* then invalidate the timeline info so we read again */
+		state->currTLI = 0;
+	}
+
+	if (state->currRecPtr == state->EndRecPtr &&
+		state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currRecPtr >= state->currTLIValidUntil)
+	{
+		/*
+		 * We're reading the immedately following record but we're at
+		 * a timeline boundary and must read the next record from the
+		 * new TLI.
+		 */
+		elog(DEBUG2, "Requested record %X/%X is after end of cur TLI %u "
+				"valid until %X/%X, switching to next timeline",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+
+		/* Invalidate TLI info so we look it up again */
+		state->currTLI = 0;
+		state->currTLIValidUntil = InvalidXLogRecPtr;
+	}
+
+	if (state->currRecPtr != state->EndRecPtr ||
+		state->currTLI == 0)
+	{
+		/*
+		 * Something changed. We're not reading the record immediately
+		 * after the one we just read, the previous record was at
+		 * timeline boundary or we didn't yet determine the timeline
+		 * to read from.
+		 *
+		 * Work out what timeline to read this record from.
+		 */
+		state->currTLI = tliOfPointInHistory(state->currRecPtr,
+				state->timelineHistory);
+
+		if (state->currTLI != ThisTimeLineID)
+		{
+			/*
+			 * It's on a historical timeline.
+			 *
+			 * We'll probably read more records after this so make a
+			 * note of the point at we have to stop reading and do
+			 * another TLI switch.
+			 *
+			 * Callbacks can also use this to avoid reading past the
+			 * valid end of the TLI.
+			 */
+			state->currTLIValidUntil = tliSwitchPoint(state->currTLI,
+					state->timelineHistory, NULL);
+		}
+		else
+		{
+			/*
+			 * We're on the current timeline. The callback can use the
+			 * xlog flush position and we don't have to worry about
+			 * the TLI ending.
+			 *
+			 * If we're in recovery from another standby (cascading)
+			 * we could receive a new timeline, making the current
+			 * timeline historical. We check that by comparing currTLI
+			 * again at each record read.
+			 */
+			state->currTLIValidUntil = InvalidXLogRecPtr;
+		}
+
+		elog(DEBUG2, "XLog read ptr %X/%X is on tli %u valid until %X/%X, current tli is %u",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil),
+				ThisTimeLineID);
+	}
+}
+
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f789fc1..f29fca3 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -231,12 +231,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
-	/* compute the current end-of-wal */
-	if (!RecoveryInProgress())
-		end_of_wal = GetFlushRecPtr();
-	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
 	ReplicationSlotAcquire(NameStr(*name));
 
 	PG_TRY();
@@ -263,6 +257,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		ctx->output_writer_private = p;
 
+		/*
+		 * We start reading xlog from the restart lsn, even though in
+		 * CreateDecodingContext we set the snapshot builder up using the
+		 * slot's candidate_restart_lsn. This means we might read xlog we don't
+		 * actually decode rows from, but the snapshot builder might need it to
+		 * get to a consistent point. The point we start returning data to
+		 * *users* at is the candidate restart lsn from the decoding context.
+		 */
 		startptr = MyReplicationSlot->data.restart_lsn;
 
 		CurrentResourceOwner = ResourceOwnerCreate(CurrentResourceOwner, "logical decoding");
@@ -270,8 +272,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		if (!RecoveryInProgress())
+			end_of_wal = GetFlushRecPtr();
+		else
+			end_of_wal = GetXLogReplayRecPtr(NULL);
+
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
-			 (ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
+			 (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
 			XLogRecord *record;
 			char	   *errm = NULL;
@@ -280,6 +288,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			if (errm)
 				elog(ERROR, "%s", errm);
 
+			/*
+			 * Now that we've set up the xlog reader state subsequent calls
+			 * pass InvalidXLogRecPtr to say "continue from last record"
+			 */
 			startptr = InvalidXLogRecPtr;
 
 			/*
@@ -299,6 +311,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			CHECK_FOR_INTERRUPTS();
 		}
 
+		/* Make sure timeline lookups use the start of the next record */
+		startptr = ctx->reader->EndRecPtr;
+
+		/*
+		 * The XLogReader will read a page past the valid end of WAL
+		 * because it doesn't know about timelines. When we switch
+		 * timelines and ask it for the first page on the new timeline it
+		 * will think it has it cached, but it'll have the old partial
+		 * page and say it can't find the next record. So flush the cache.
+		 */
+		XLogReaderInvalCache(ctx->reader);
+
 		tuplestore_donestoring(tupstore);
 
 		CurrentResourceOwner = old_resowner;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 7553cc4..4ccee95 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -20,12 +20,16 @@
  *		with the XLogRec* macros and functions. You can also decode a
  *		record that's already constructed in memory, without reading from
  *		disk, by calling the DecodeXLogRecord() function.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 #ifndef XLOGREADER_H
 #define XLOGREADER_H
 
 #include "access/xlogrecord.h"
+#include "nodes/pg_list.h"
 
 typedef struct XLogReaderState XLogReaderState;
 
@@ -139,26 +143,46 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+	/*
+	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to
+	 * at least readLen bytes)
+	 */
 	char	   *readBuf;
 
-	/* last read segment, segment offset, read length, TLI */
+	/*
+	 * last read segment, segment offset, read length, TLI for
+	 * data currently in readBuf.
+	 */
 	XLogSegNo	readSegNo;
 	uint32		readOff;
 	uint32		readLen;
 	TimeLineID	readPageTLI;
 
-	/* beginning of last page read, and its TLI  */
+	/*
+	 * beginning of prior page read, and its TLI. Doesn't
+	 * necessarily correspond to what's in readBuf, used for
+	 * timeline sanity checks.
+	 */
 	XLogRecPtr	latestPagePtr;
 	TimeLineID	latestPageTLI;
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID  currTLI;
+	/*
+	 * Endpoint of timeline in currTLI if it's historical or
+	 * InvalidXLogRecPtr if currTLI is the current timeline.
+	 */
+	XLogRecPtr	currTLIValidUntil;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
+	/* cached timeline history */
+	List	   *timelineHistory;
+
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 };
@@ -174,6 +198,9 @@ extern void XLogReaderFree(XLogReaderState *state);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 			   XLogRecPtr recptr, char **errormsg);
 
+/* Flush any cached page */
+extern void XLogReaderInvalCache(XLogReaderState *state);
+
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif   /* FRONTEND */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 1b9abce..86df8cf 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -50,4 +50,6 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 extern int read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state);
+
 #endif
-- 
2.1.0

0002-Allow-replication-slots-to-follow-failover.patchtext/x-patch; charset=US-ASCII; name=0002-Allow-replication-slots-to-follow-failover.patchDownload
From 896c23024138299193f25e59645a74bd1ac8c5b2 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 15 Feb 2016 11:56:13 +0800
Subject: [PATCH 2/4] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

This patch adds a new backup label entry 'MIN FAILOVER SLOT LSN' that,
if present, indicates the minimum LSN needed by any failover slot that
is present in the base backup. Backup tools should check for this entry
and ensure they retain all xlogs including and after that point. It also
changes the return value of pg_start_backup(), the BASE_BACKUP walsender
command, etc to report the minimum WAL required by any failover slot
if this is a lower LSN than the redo position so that base backups
contain the WAL required for slots to work.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile       |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  65 ++++
 src/backend/access/transam/rmgr.c          |   1 +
 src/backend/access/transam/xlog.c          |  45 ++-
 src/backend/commands/dbcommands.c          |   3 +
 src/backend/replication/basebackup.c       |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  25 +-
 src/backend/replication/slot.c             | 591 +++++++++++++++++++++++++++--
 src/backend/replication/slotfuncs.c        |   4 +-
 src/backend/replication/walsender.c        |   8 +-
 src/bin/pg_xlogdump/replslotdesc.c         |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c             |   1 +
 src/include/access/rmgrlist.h              |   1 +
 src/include/replication/slot.h             |  69 +---
 src/include/replication/slot_xlog.h        | 100 +++++
 16 files changed, 798 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..5829e8d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "of slot %s with restart %X/%X and xid %u confirmed to %X/%X",
+						NameStr(xlrec->name),
+						(uint32)(xlrec->restart_lsn>>32), (uint32)(xlrec->restart_lsn),
+						xlrec->xmin,
+						(uint32)(xlrec->confirmed_flush>>32), (uint32)(xlrec->confirmed_flush));
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "of slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 94b79ac..80d0aa5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6366,8 +6366,11 @@ StartupXLOG(void)
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
+	 *
+	 * If we're in archive recovery then non-failover slots are no
+	 * longer of any use and should be dropped during startup.
 	 */
-	StartupReplicationSlots();
+	StartupReplicationSlots(ArchiveRecoveryRequested);
 
 	/*
 	 * Startup logical state, needs to be setup now so we have proper data
@@ -9794,6 +9797,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	XLogRecPtr  slot_startpoint;
 	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -9940,6 +9944,16 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
+			/*
+			 * If failover slots are in use we must retain and transfer WAL
+			 * older than the redo location so that those slots can be replayed
+			 * from after a failover event.
+			 *
+			 * This MUST be at an xlog segment boundary so truncate the LSN
+			 * appropriately.
+			 */
+			slot_startpoint = (ReplicationSlotsComputeRequiredLSN(true)/ XLOG_SEG_SIZE) * XLOG_SEG_SIZE;
+
 			if (backup_started_in_recovery)
 			{
 				XLogRecPtr	recptr;
@@ -10108,6 +10122,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 						 backup_started_in_recovery ? "standby" : "master");
 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+		if (slot_startpoint != InvalidXLogRecPtr)
+			appendStringInfo(&labelfbuf,  "MIN FAILOVER SLOT LSN: %X/%X\n",
+						(uint32)(slot_startpoint>>32), (uint32)slot_startpoint);
+
 
 		/*
 		 * Okay, write the file, or return its contents to caller.
@@ -10201,10 +10219,33 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
+	 *
+	 * pg_basebackup etc expect to use this as the position to start copying
+	 * WAL from, so we should return the minimum of the slot start LSN and the
+	 * current redo position to make sure we get all WAL required by failover
+	 * slots.
+	 *
+	 * The min required LSN for failover slots is also available from the
+	 * 'MIN FAILOVER SLOT LSN' entry in the backup label file.
 	 */
+	if (slot_startpoint < startpoint)
+	{
+		List *history;
+		TimeLineID slot_start_tli;
+
+		/* Min LSN required by a slot may be on an older timeline. */
+		history = readTimeLineHistory(ThisTimeLineID);
+		slot_start_tli = tliOfPointInHistory(slot_startpoint, history);
+		list_free_deep(history);
+
+		if (slot_start_tli < starttli)
+			starttli = slot_start_tli;
+	}
+
 	if (starttli_p)
 		*starttli_p = starttli;
-	return startpoint;
+
+	return slot_startpoint < startpoint ? slot_startpoint : startpoint;
 }
 
 /* Error cleanup callback for pg_start_backup */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 88c3a49..76fc5c7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..4feb2ca 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we need we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
@@ -272,7 +275,7 @@ CreateInitDecodingContext(char *plugin,
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
 
-	ReplicationSlotsComputeRequiredXmin(true);
+	ReplicationSlotsUpdateRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -908,8 +911,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			MyReplicationSlot->effective_catalog_xmin = MyReplicationSlot->data.catalog_xmin;
 			SpinLockRelease(&MyReplicationSlot->mutex);
 
-			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
+			ReplicationSlotsUpdateRequiredXmin(false);
+			ReplicationSlotsUpdateRequiredLSN();
 		}
 	}
 	else
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a2c6524..54c997a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -24,7 +24,18 @@
  * directory. Inside that directory the state file will contain the slot's
  * own data. Additional data can be stored alongside that file if required.
  * While the server is running, the state data is also cached in memory for
- * efficiency.
+ * efficiency. Non-failover slots are NOT subject to WAL logging and may
+ * be used on standbys (though that's only supported for physical slots at
+ * the moment). They use tempfile writes and swaps for crash safety.
+ *
+ * A failover slot created on a master node generates WAL records that
+ * maintain a copy of the slot on standby nodes. If a standby node is
+ * promoted the failover slot allows access to be restarted just as if the
+ * the original master node was being accessed, allowing for the timeline
+ * change. The replica considers slot positions when removing WAL to make
+ * sure it can satisfy the needs of slots after promotion.  For logical
+ * decoding slots the slot's internal state is kept up to date so it's
+ * ready for use after promotion.
  *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -101,10 +113,14 @@ static LWLockTranche ReplSlotIOLWLockTranche;
 static void ReplicationSlotDropAcquired(void);
 
 /* internal persistency functions */
-static void RestoreSlotFromDisk(const char *name);
+static void RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -220,7 +236,8 @@ ReplicationSlotValidateName(const char *name, int elevel)
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency persistency)
+					  ReplicationSlotPersistency persistency,
+					  bool failover)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -273,11 +290,23 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	Assert(!slot->in_use);
 	Assert(slot->active_pid == 0);
 	slot->data.persistency = persistency;
+
+	elog(LOG, "persistency is %i", (int)slot->data.persistency);
+
 	slot->data.xmin = InvalidTransactionId;
 	slot->effective_xmin = InvalidTransactionId;
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/* Slot timeline is unused and always zero */
+	slot->data.restart_tli = 0;
+
+	if (failover && RecoveryInProgress())
+		ereport(ERROR,
+				(errmsg("a failover slot may not be created on a replica"),
+				 errhint("Create the slot on the master server instead")));
+
+	slot->data.failover = failover;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -313,6 +342,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -335,7 +368,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && s->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -349,12 +386,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is already active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -411,16 +460,24 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
-ReplicationSlotDropAcquired(void)
+ReplicationSlotDropAcquired()
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false,
+		 took_allocation_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -428,8 +485,27 @@ ReplicationSlotDropAcquired(void)
 	 * If some other backend ran this code concurrently with us, we might try
 	 * to delete a slot with a certain name while someone else was trying to
 	 * create a slot with the same name.
+	 *
+	 * If called with the lock already held it MUST be held in
+	 * EXCLUSIVE mode.
 	 */
-	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotAllocationLock))
+	{
+		took_allocation_lock = true;
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	}
+
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), NAMEDATALEN);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
 
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
@@ -459,7 +535,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -477,18 +557,27 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
 	 * limits.
 	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	/*
 	 * If removing the directory fails, the worst thing that will happen is
@@ -504,7 +593,8 @@ ReplicationSlotDropAcquired(void)
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
-	LWLockRelease(ReplicationSlotAllocationLock);
+	if (took_allocation_lock)
+		LWLockRelease(ReplicationSlotAllocationLock);
 }
 
 /*
@@ -544,6 +634,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -565,7 +658,7 @@ ReplicationSlotPersist(void)
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  */
 void
-ReplicationSlotsComputeRequiredXmin(bool already_locked)
+ReplicationSlotsUpdateRequiredXmin(bool already_locked)
 {
 	int			i;
 	TransactionId agg_xmin = InvalidTransactionId;
@@ -610,10 +703,20 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 }
 
 /*
- * Compute the oldest restart LSN across all slots and inform xlog module.
+ * Update the xlog module's copy of the minimum restart lsn across all slots
  */
 void
-ReplicationSlotsComputeRequiredLSN(void)
+ReplicationSlotsUpdateRequiredLSN(void)
+{
+	XLogSetReplicationSlotMinimumLSN(ReplicationSlotsComputeRequiredLSN(false));
+}
+
+/*
+ * Compute the oldest restart LSN across all slots (or optionally
+ * only failover slots) and return it.
+ */
+XLogRecPtr
+ReplicationSlotsComputeRequiredLSN(bool failover_only)
 {
 	int			i;
 	XLogRecPtr	min_required = InvalidXLogRecPtr;
@@ -625,14 +728,19 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		failover;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		failover = s->data.failover;
 		SpinLockRelease(&s->mutex);
 
+		if (failover_only && !failover)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -640,7 +748,7 @@ ReplicationSlotsComputeRequiredLSN(void)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	XLogSetReplicationSlotMinimumLSN(min_required);
+	return min_required;
 }
 
 /*
@@ -649,7 +757,7 @@ ReplicationSlotsComputeRequiredLSN(void)
  * Returns InvalidXLogRecPtr if logical decoding is disabled or no logical
  * slots exist.
  *
- * NB: this returns a value >= ReplicationSlotsComputeRequiredLSN(), since it
+ * NB: this returns a value >= ReplicationSlotsUpdateRequiredLSN(), since it
  * ignores physical replication slots.
  *
  * The results aren't required frequently, so we don't maintain a precomputed
@@ -747,6 +855,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -779,12 +926,13 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
+	 * ReplicationSlotsUpdateRequiredLSN() is used. In the unlikely case that
 	 * this happens we'll just retry.
 	 */
 	while (true)
@@ -821,12 +969,12 @@ ReplicationSlotReserveWal(void)
 		}
 
 		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
+		 * concurrent ReplicationSlotsUpdateRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
 		 * more than twice.
 		 */
@@ -878,7 +1026,7 @@ CheckPointReplicationSlots(void)
  * needs to be run before we start crash recovery.
  */
 void
-StartupReplicationSlots(void)
+StartupReplicationSlots(bool drop_nonfailover_slots)
 {
 	DIR		   *replication_dir;
 	struct dirent *replication_de;
@@ -917,7 +1065,7 @@ StartupReplicationSlots(void)
 		}
 
 		/* looks like a slot in a normal state, restore */
-		RestoreSlotFromDisk(replication_de->d_name);
+		RestoreSlotFromDisk(replication_de->d_name, drop_nonfailover_slots);
 	}
 	FreeDir(replication_dir);
 
@@ -926,8 +1074,8 @@ StartupReplicationSlots(void)
 		return;
 
 	/* Now that we have recovered all the data, compute replication xmin */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 }
 
 /* ----
@@ -996,6 +1144,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -1006,15 +1156,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1047,6 +1200,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1116,7 +1288,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
  * Load a single slot from disk into memory.
  */
 static void
-RestoreSlotFromDisk(const char *name)
+RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots)
 {
 	ReplicationSlotOnDisk cp;
 	int			i;
@@ -1235,10 +1407,21 @@ RestoreSlotFromDisk(const char *name)
 						path, checksum, cp.checksum)));
 
 	/*
-	 * If we crashed with an ephemeral slot active, don't restore but delete
-	 * it.
+	 * If we crashed with an ephemeral slot active, don't restore but
+	 * delete it.
+	 *
+	 * Similarly, if we're in archive recovery and will be running as
+	 * a standby (when drop_nonfailover_slots is set), non-failover
+	 * slots can't be relied upon. Logical slots might have a catalog
+	 * xmin lower than reality because the original slot on the master
+	 * advanced past the point the stale slot on the replica is stuck
+	 * at. Additionally slots might have been copied while being
+	 * written to if the basebackup copy method was not atomic.
+	 * Failover slots are safe since they're WAL-logged and follow the
+	 * master's slot position.
 	 */
-	if (cp.slotdata.persistency != RS_PERSISTENT)
+	if (cp.slotdata.persistency != RS_PERSISTENT
+			|| (drop_nonfailover_slots && !cp.slotdata.failover))
 	{
 		sprintf(path, "pg_replslot/%s", name);
 
@@ -1249,6 +1432,14 @@ RestoreSlotFromDisk(const char *name)
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
 		fsync_fname("pg_replslot", true);
+
+		if (cp.slotdata.persistency == RS_PERSISTENT)
+		{
+			ereport(LOG,
+					(errmsg("dropped non-failover slot %s during archive recovery",
+							 NameStr(cp.slotdata.name))));
+		}
+
 		return;
 	}
 
@@ -1285,5 +1476,331 @@ RestoreSlotFromDisk(const char *name)
 	if (!restored)
 		ereport(PANIC,
 				(errmsg("too many replication slots active before shutdown"),
-				 errhint("Increase max_replication_slots and try again.")));
+				 errhint("Increase max_replication_slots (currently %u) and try again.",
+					 max_replication_slots)));
+}
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * We're in redo, but someone could still create a local
+	 * non-failover slot and race with us unless we take the
+	 * allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning in case there's an existing slot with the same
+		 * name.
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Existing slot with same name? It could be our failover slot
+		 * to update or a non-failover slot with a conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * A local non-failover slot exists with the same name as
+		 * the failover slot we're creating.
+		 *
+		 * Clobber the client, drop its slot, and carry on with
+		 * our business.
+		 *
+		 * First we must temporarily release the allocation lock while
+		 * we try to terminate the process that holds the slot, since
+		 * we don't want to hold the LWlock for ages. We'll reacquire
+		 * it later.
+		 */
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/* We might race with other clients, so retry-loop */
+		do
+		{
+			int active_pid = slot->active_pid;
+			int max_sleep_millis = 120 * 1000;
+			int millis_per_sleep = 1000;
+
+			if (active_pid != 0)
+			{
+				ereport(INFO,
+						(errmsg("terminating active connection by pid %u to local slot %s because of conflict with recovery",
+							active_pid, NameStr(slot->data.name))));
+
+				if (kill(active_pid, SIGTERM))
+					elog(DEBUG1, "failed to signal pid %u to terminate on slot conflict: %m",
+							active_pid);
+
+				/*
+				 * No way to wait for the process since it's not a child
+				 * of ours and there's no latch to set, so poll.
+				 *
+				 * We're checking this without any locks held, but
+				 * we'll recheck when we attempt to drop the slot.
+				 */
+				while (slot->in_use && slot->active_pid == active_pid
+						&& max_sleep_millis > 0)
+				{
+					int rc;
+
+					rc = WaitLatch(MyLatch,
+							WL_TIMEOUT | WL_LATCH_SET | WL_POSTMASTER_DEATH,
+							millis_per_sleep);
+
+					if (rc & WL_POSTMASTER_DEATH)
+						elog(FATAL, "exiting after postmaster termination");
+
+					/*
+					 * Might be shorter if something sets our latch, but
+					 * we don't care much.
+					 */
+					max_sleep_millis -= millis_per_sleep;
+				}
+
+				if (max_sleep_millis <= 0)
+					elog(WARNING, "process %u is taking too long to terminate after SIGTERM",
+							slot->active_pid);
+			}
+
+			if (active_pid == 0)
+			{
+				/* Try to acquire and drop the slot */
+				SpinLockAcquire(&slot->mutex);
+
+				if (slot->active_pid != 0)
+				{
+					/* Lost the race, go around */
+				}
+				else
+				{
+					/* Claim the slot for ourselves */
+					slot->active_pid = MyProcPid;
+					MyReplicationSlot = slot;
+				}
+				SpinLockRelease(&slot->mutex);
+			}
+
+			if (slot->active_pid == MyProcPid)
+			{
+				NameData slotname;
+				strncpy(NameStr(slotname), NameStr(slot->data.name), NAMEDATALEN);
+				(NameStr(slotname))[NAMEDATALEN-1] = '\0';
+
+				/*
+				 * Reclaim the allocation lock and THEN drop the slot,
+				 * so nobody else can grab the name until we've
+				 * finished redo.
+				 */
+				LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+				ReplicationSlotDropAcquired();
+				/* We clobbered the duplicate, treat it as new */
+				found_duplicate = false;
+
+				ereport(WARNING,
+						(errmsg("dropped local replication slot %s because of conflict with recovery",
+								NameStr(slotname)),
+						 errdetail("A failover slot with the same name was created on the master server")));
+			}
+		}
+		while (slot->in_use);
+	}
+
+	Assert(LWLockHeldByMe(ReplicationSlotAllocationLock));
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			Assert(!slot->in_use);
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+
+		ReplicationSlotsUpdateRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredLSN();
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+	{
+		ReplicationSlotDropAcquired();
+	}
+	else
+	{
+		elog(WARNING, "failover slot %s not found during redo of drop",
+				slotname);
+	}
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..f430714 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -57,7 +57,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -120,7 +120,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..1583862 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
 	}
 
 	initStringInfo(&output_message);
@@ -1523,7 +1523,7 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 	}
 
 	/*
@@ -1619,7 +1619,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredXmin(false);
 	}
 }
 
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8be8ab6..cdcbd37 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -11,69 +11,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -155,7 +98,7 @@ extern void ReplicationSlotsShmemInit(void);
 
 /* management of individual slots */
 extern void ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency p);
+					  ReplicationSlotPersistency p, bool failover);
 extern void ReplicationSlotPersist(void);
 extern void ReplicationSlotDrop(const char *name);
 
@@ -167,12 +110,14 @@ extern void ReplicationSlotMarkDirty(void);
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
 extern void ReplicationSlotReserveWal(void);
-extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
-extern void ReplicationSlotsComputeRequiredLSN(void);
+extern void ReplicationSlotsUpdateRequiredXmin(bool already_locked);
+extern void ReplicationSlotsUpdateRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
+extern XLogRecPtr ReplicationSlotsComputeRequiredLSN(bool failover_only);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
-extern void StartupReplicationSlots(void);
+extern void StartupReplicationSlots(bool drop_nonfailover_slots);
 extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..f8e09a8
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ */
+#define XLOG_REPLSLOT_UPDATE	0x00
+#define XLOG_REPLSLOT_DROP		0x01
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
-- 
2.1.0

0003-Add-the-UI-and-documentation-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0003-Add-the-UI-and-documentation-for-failover-slots.patchDownload
From b615b1ca786be023d27886c0d0bce4c7c50444d0 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 15 Feb 2016 12:00:59 +0800
Subject: [PATCH 3/4] Add the UI and documentation for failover slots

Expose failover slots to the user.

Add a new 'failover' argument to pg_create_logical_replication_slot and
pg_create_physical_replication_slot . Report if a slot is a failover
slot in pg_catalog.pg_replication_slots. Accept a new FAILOVER keyword
argument in CREATE_REPLICATION_SLOT on the walsender protocol.

Document the existence of failover slots support and how to use them.
---
 contrib/test_decoding/expected/ddl.out | 41 ++++++++++++++++++---
 contrib/test_decoding/sql/ddl.sql      | 17 ++++++++-
 doc/src/sgml/catalogs.sgml             | 10 +++++
 doc/src/sgml/func.sgml                 | 24 ++++++++----
 doc/src/sgml/high-availability.sgml    | 67 ++++++++++++++++++++++++++++++++--
 doc/src/sgml/logicaldecoding.sgml      | 52 +++++++++++++++++---------
 doc/src/sgml/protocol.sgml             | 22 ++++++++++-
 src/backend/catalog/system_views.sql   | 12 +++++-
 src/backend/replication/repl_gram.y    | 13 ++++++-
 src/backend/replication/slotfuncs.c    | 13 +++++--
 src/backend/replication/walsender.c    |  4 +-
 src/include/catalog/pg_proc.h          |  6 +--
 src/include/nodes/replnodes.h          |  1 +
 src/include/replication/slot.h         |  1 +
 src/test/regress/expected/rules.out    |  3 +-
 15 files changed, 236 insertions(+), 50 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 57a1289..5b2f34a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -9,6 +9,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 ERROR:  replication slot "regression_slot" already exists
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
+ERROR:  replication slot "regression_slot" already exists
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 ERROR:  replication slot name "Invalid Name" contains invalid character
@@ -58,11 +61,37 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
-    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal 
------------------+---------------+-----------+--------+------------------+-------------------+----------
- regression_slot | test_decoding | logical   | f      | t                | t                 | t
+    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+-----------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ regression_slot | test_decoding | logical   | f      | t                | t                 | t        | f
+(1 row)
+
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+ ?column? 
+----------
+ init
+(1 row)
+
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+   slot_name   |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+---------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ failover_slot | test_decoding | logical   | f      | t                | t                 | t        | t
+(1 row)
+
+SELECT pg_drop_replication_slot('failover_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
 (1 row)
 
 /*
@@ -673,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | active | active_pid | failover | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
+-----------+--------+-----------+--------+----------+--------+------------+----------+------+--------------+-------------+---------------------
 (0 rows)
 
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index e311c59..f64b21c 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -4,6 +4,8 @@ SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 
@@ -22,16 +24,27 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 SELECT pg_drop_replication_slot('repl');
 
-
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
 /* check whether status function reports us, only reproduceable columns */
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
 
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+SELECT pg_drop_replication_slot('failover_slot');
+
 /*
  * Check that changes are handled correctly when interleaved with ddl
  */
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 412c845..053b91a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5377,6 +5377,16 @@
      </row>
 
      <row>
+      <entry><structfield>failover</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       True if this slot is a failover slot; see
+       <xref linkend="streaming-replication-slots-failover"/>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>xmin</structfield></entry>
       <entry><type>xid</type></entry>
       <entry></entry>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f9eea76..ef49bd7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17420,7 +17420,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>, <parameter>immediately_reserve</> <type>boolean</> </optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <optional><parameter>immediately_reserve</> <type>boolean</></optional>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17431,7 +17431,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         when <literal>true</>, specifies that the <acronym>LSN</> for this
         replication slot be reserved immediately; otherwise
         the <acronym>LSN</> is reserved on first connection from a streaming
-        replication client. Streaming changes from a physical slot is only
+        replication client. If <literal>failover</literal> is <literal>true</literal>
+        then the slot is created as a failover slot; see <xref
+        linkend="streaming-replication-slots-failover">.
+        Streaming changes from a physical slot is only
         possible with the streaming-replication protocol &mdash;
         see <xref linkend="protocol-replication">. This function corresponds
         to the replication protocol command <literal>CREATE_REPLICATION_SLOT
@@ -17460,7 +17463,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_logical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>)</function></literal>
+        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17468,8 +17471,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new logical (decoding) replication slot named
         <parameter>slot_name</parameter> using the output plugin
-        <parameter>plugin</parameter>.  A call to this function has the same
-        effect as the replication protocol command
+        <parameter>plugin</parameter>. If <literal>failover</literal>
+        is <literal>true</literal> the slot is created as a failover
+        slot; see <xref linkend="streaming-replication-slots-failover">. A call to
+        this function has the same effect as the replication protocol command
         <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
        </entry>
       </row>
@@ -17485,7 +17490,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         (<parameter>location</parameter> <type>pg_lsn</type>, <parameter>xid</parameter> <type>xid</type>, <parameter>data</parameter> <type>text</type>)
        </entry>
        <entry>
-        Returns changes in the slot <parameter>slot_name</parameter>, starting
+        eturns changes in the slot <parameter>slot_name</parameter>, starting
         from the point at which since changes have been consumed last.  If
         <parameter>upto_lsn</> and <parameter>upto_nchanges</> are NULL,
         logical decoding will continue until end of WAL.  If
@@ -17495,7 +17500,12 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         stop when the number of rows produced by decoding exceeds
         the specified value.  Note, however, that the actual number of
         rows returned may be larger, since this limit is only checked after
-        adding the rows produced when decoding each new transaction commit.
+        adding the rows produced when decoding each new transaction commit,
+        so at least one transaction is always returned. The returned changes
+        are consumed and will not be returned by a subsequent calls to
+        <function>pg_logical_slot_get_changes</function>, though a server
+        crash may cause recently consumed changes to be replayed again after
+        recovery.
        </entry>
       </row>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6cb690c..1624d51 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -859,7 +859,8 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
      <xref linkend="functions-recovery-info-table"> for details).
      The last WAL receive location in the standby is also displayed in the
      process status of the WAL receiver process, displayed using the
-     <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
+     <command>ps</> command (see <xref linkend="monitoring-ps"> for details)
+     and in the <literal>pg_stat_replication</literal> view.
     </para>
     <para>
      You can retrieve a list of WAL sender processes via the
@@ -871,10 +872,15 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
      <function>pg_last_xlog_receive_location</> on the standby might indicate
      network delay, or that the standby is under heavy load.
     </para>
+    <para>
+     Compare xlog locations and measure lag using
+     <link linkend="functions-admin-backup"><function>pg_xlog_location_diff(...)</function></link>
+     in a query over <literal>pg_stat_replication</literal>.
+    </para>
    </sect3>
   </sect2>
 
-  <sect2 id="streaming-replication-slots">
+  <sect2 id="streaming-replication-slots" xreflabel="Replication slots">
    <title>Replication Slots</title>
    <indexterm>
     <primary>replication slot</primary>
@@ -885,7 +891,10 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     not remove WAL segments until they have been received by all standbys,
     and that the master does not remove rows which could cause a
     <link linkend="hot-standby-conflict">recovery conflict</> even when the
-    standby is disconnected.
+    standby is disconnected. They allow clients to receive a stream of
+    changes in in ordered, consistent manner - either raw WAL from a physical
+    replication slot or a logical stream of row changes from a
+    <link linkend="logicaldecoding-slots">logical replication slot</link>.
    </para>
    <para>
     In lieu of using replication slots, it is possible to prevent the removal
@@ -906,6 +915,17 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     and the latter often needs to be set to a high value to provide adequate
     protection.  Replication slots overcome these disadvantages.
    </para>
+   <para>
+    Because replication slots cause the server to retain transaction logs
+    in <filename>pg_xlog</filename> it is important to monitor how far slots
+    are lagging behind the master in order to prevent the disk from filling
+    up and interrupting the master's operation. A query like:
+    <programlisting>
+      SELECT *, pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn) AS lag_bytes
+      FROM pg_replication_slots;
+    </programlisting>
+    will provide an indication of how far a slot is lagging.
+   </para>
    <sect3 id="streaming-replication-slots-manipulation">
     <title>Querying and manipulating replication slots</title>
     <para>
@@ -949,6 +969,47 @@ primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
+
+   <sect3 id="streaming-replication-slots-failover" xreflabel="Failover slots">
+     <title>Failover slots</title>
+
+     <para>
+      Normally a replication slot is not preserved across backup and restore
+      (such as by <application>pg_basebackup</application>) and is not
+      replicated to standbys. Slots are <emphasis>automatically
+      dropped</emphasis> when starting up as a streaming replica or in archive
+      recovery (PITR) mode.
+     </para>
+
+     <para>
+      To make it possible to for an application to consistently follow
+      failover when a replica is promoted to a new master a slot may be
+      created as a <emphasis>failover slot</emphasis>. A failover slot may
+      only be created, replayed from or dropped on a master server. Changes to
+      the slot are written to WAL and replicated to standbys. When a standby
+      is promoted applications may connect to the slot on the standby and
+      resume replay from it at a consistent point, as if it was the original
+      master. Failover slots may not be used to replay from a standby before
+      promotion.
+     </para>
+
+     <para>
+      Non-failover slots may be created on and used from a replica. This is
+      currently limited to physical slots as logical decoding is not supported
+      on replica server.
+     </para>
+
+     <para>
+      When a failover slot created on the master has the same name as a
+      non-failover slot on a replica server the non-failover slot will be
+      automatically dropped. Any client currently connected will be
+      disconnected with an error indicating a conflict with recovery. It
+      is strongly recommended that you avoid creating failover slots with
+      the same name as slots on replicas.
+     </para>
+
+   </sect3>
+
   </sect2>
 
   <sect2 id="cascading-replication">
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index e841348..7f6a73d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -12,15 +12,17 @@
 
   <para>
    Changes are sent out in streams identified by logical replication slots.
-   Each stream outputs each change exactly once.
+   Each stream outputs each change once, though repeats are possible after a
+   server crash.
   </para>
 
   <para>
    The format in which those changes are streamed is determined by the output
-   plugin used.  An example plugin is provided in the PostgreSQL distribution.
-   Additional plugins can be
-   written to extend the choice of available formats without modifying any
-   core code.
+   plugin used.  An example plugin (test_decoding) is provided in the
+   PostgreSQL distribution.  Additional plugins can be written to extend the
+   choice of available formats without modifying any core code.
+  </para>
+  <para>
    Every output plugin has access to each individual new row produced
    by <command>INSERT</command> and the new row version created
    by <command>UPDATE</command>.  Availability of old row versions for
@@ -192,7 +194,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </para>
    </sect2>
 
-   <sect2>
+   <sect2 id="logicaldecoding-slots" xreflabel="Logical Replication Slots">
     <title>Replication Slots</title>
 
     <indexterm>
@@ -201,20 +203,18 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     </indexterm>
 
     <para>
+     The general concepts of replication slots are discussed under
+     <xref linkend="streaming-replication-slots">. This topic only covers
+     specifics for logical slots.
+    </para>
+
+    <para>
      In the context of logical replication, a slot represents a stream of
      changes that can be replayed to a client in the order they were made on
      the origin server. Each slot streams a sequence of changes from a single
-     database, sending each change exactly once (except when peeking forward
-     in the stream).
+     database, sending each change only once.
     </para>
 
-    <note>
-     <para><productname>PostgreSQL</productname> also has streaming replication slots
-     (see <xref linkend="streaming-replication">), but they are used somewhat
-     differently there.
-     </para>
-    </note>
-
     <para>
      A replication slot has an identifier that is unique across all databases
      in a <productname>PostgreSQL</productname> cluster. Slots persist
@@ -243,9 +243,22 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
       even when there is no connection using them. This consumes storage
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
-      slot.  So if a slot is no longer required it should be dropped.
+      slot.  If a slot is no longer required it should be dropped to prevent
+      <filename>pg_xlog</filename> from filling up and (for logical slots)
+      the system catalogs from bloating.
      </para>
     </note>
+
+    <para>
+     A replication slot keeps track of the oldest needed WAL position that the
+     application may need to restart replay from. It does <emphasis>not</emphasis>
+     guarantee never to replay the same data twice and updates to the restart
+     position are not immediately flushed to disk so they may be lost if the
+     server crashes. The client application is responsible for keeping track of
+     the point it has replayed up to and should request that replay restart at
+     that point when it reconnects by passing the last-replayed LSN to the
+     start replication command.
+    </para>
    </sect2>
 
    <sect2>
@@ -268,7 +281,10 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
      SNAPSHOT</literal></link> to read the state of the database at the moment
      the slot was created. This transaction can then be used to dump the
      database's state at that point in time, which afterwards can be updated
-     using the slot's contents without losing any changes.
+     using the slot's contents without losing any changes. The exported snapshot
+     remains valid until the connection that created it runs another command
+     or disconnects. It may be imported into another connection and re-exported
+     to preserve it longer.
     </para>
    </sect2>
   </sect1>
@@ -280,7 +296,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     The commands
     <itemizedlist>
      <listitem>
-      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable></literal></para>
+      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable> <optional>FAILOVER</optional></literal></para>
      </listitem>
 
      <listitem>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 522128e..cbf523d 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1434,13 +1434,14 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> <optional><literal>RESERVE_WAL</></> | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> } <optional><literal>FAILOVER</></>
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
      <para>
       Create a physical or logical replication
-      slot. See <xref linkend="streaming-replication-slots"> for more about
+      slot. See <xref linkend="streaming-replication-slots"> and
+      <xref linkend="logicaldecoding-slots"> for more about
       replication slots.
      </para>
      <variablelist>
@@ -1474,6 +1475,17 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>FAILOVER</></term>
+       <listitem>
+        <para>
+         Create this slot as a <link linkend="streaming-replication-slots-failover">
+         failover slot</link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </listitem>
   </varlistentry>
@@ -1829,6 +1841,12 @@ The commands accepted in walsender mode are:
       to process the output for streaming.
      </para>
 
+     <para>
+      Logical replication automatically follows timeline switches. It is
+      not necessary or possible to supply a <literal>TIMELINE</literal>
+      option like in physical replication.
+     </para>
+
      <variablelist>
       <varlistentry>
        <term><literal>SLOT</literal> <replaceable class="parameter">slot_name</></term>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..b4f8fbe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -698,6 +698,7 @@ CREATE VIEW pg_replication_slots AS
             D.datname AS database,
             L.active,
             L.active_pid,
+            L.failover,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
@@ -943,12 +944,21 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
     IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-    OUT slot_name name, OUT xlog_position pg_lsn)
+    IN failover boolean DEFAULT false, OUT slot_name name,
+    OUT xlog_position pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_physical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot(
+    IN slot_name name, IN plugin name, IN failover boolean DEFAULT false,
+    OUT slot_name text, OUT xlog_position pg_lsn)
+RETURNS RECORD
+LANGUAGE INTERNAL
+STRICT VOLATILE
+AS 'pg_create_logical_replication_slot';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d93db88..1574f24 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_FAILOVER
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_failover
 
 %%
 
@@ -184,23 +186,25 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->reserve_wal = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +280,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_failover:
+			K_FAILOVER						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f430714..abc450d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
@@ -41,6 +42,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	bool 		immediately_reserve = PG_GETARG_BOOL(1);
+	bool		failover = PG_GETARG_BOOL(2);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
@@ -57,7 +59,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, failover);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -96,6 +98,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	Name		plugin = PG_GETARG_NAME(1);
+	bool		failover = PG_GETARG_BOOL(2);
 
 	LogicalDecodingContext *ctx = NULL;
 
@@ -120,7 +123,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, failover);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
@@ -174,7 +177,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 10
+#define PG_GET_REPLICATION_SLOTS_COLS 11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -224,6 +227,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		XLogRecPtr	restart_lsn;
 		XLogRecPtr	confirmed_flush_lsn;
 		pid_t		active_pid;
+		bool		failover;
 		Oid			database;
 		NameData	slot_name;
 		NameData	plugin;
@@ -246,6 +250,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			namecpy(&plugin, &slot->data.plugin);
 
 			active_pid = slot->active_pid;
+			failover = slot->data.failover;
 		}
 		SpinLockRelease(&slot->mutex);
 
@@ -276,6 +281,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
+		values[i++] = BoolGetDatum(failover);
+
 		if (xmin != InvalidTransactionId)
 			values[i++] = TransactionIdGetDatum(xmin);
 		else
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1583862..efdbfd1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, cmd->failover);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, cmd->failover);
 	}
 
 	initStringInfo(&output_message);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b24e434..89ecada 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5064,13 +5064,13 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 /* replication slots */
-DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16" "{19,16,19,3220}" "{i,i,o,o}" "{slot_name,immediately_reserve,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16 16" "{19,16,16,19,3220}" "{i,i,i,o,o}" "{slot_name,immediately_reserve,failover,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
+DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,16,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,failover,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
-DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19" "{19,19,25,3220}" "{i,i,o,o}" "{slot_name,plugin,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
 DATA(insert OID = 3782 (  pg_logical_slot_get_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v u 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,25}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ _null_ pg_logical_slot_get_changes _null_ _null_ _null_ ));
 DESCR("get changes from replication slot");
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..a8fa9d5 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		failover;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index cdcbd37..9e23a29 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2bdba2d..f5dd4a8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1414,11 +1414,12 @@ pg_replication_slots| SELECT l.slot_name,
     d.datname AS database,
     l.active,
     l.active_pid,
+    l.failover,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
     l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, failover, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.1.0

#17Oleksii Kliukin
alexk@hintbits.com
In reply to: Craig Ringer (#16)
Re: WIP: Failover Slots

Hi,

On 16 Feb 2016, at 09:11, Craig Ringer <craig@2ndquadrant.com> wrote:

Revision attached. There was a file missing from the patch too.

All attached patches apply normally. I only took a look at first 2, but also tried to run the Patroni with the modified version to check whether the basic replication works.

What it’s doing is calling pg_basebackup first to initialize the replica, and that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR: requested WAL segment 000000010000000000000000 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream --dbname="host=localhost port=5432 user=replicator”

I checked the same command against the git master without the patches applied and could not reproduce this problem there.

On the code level, I have no comments on 0001, it’s well documented and I have no questions about the approach, although I might be not too knowledgable to judge the specifics of the implementation.

On the 0002, there are a few rough edges:

slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?

Kind regards,
--
Oleksii

#18Craig Ringer
craig@2ndquadrant.com
In reply to: Oleksii Kliukin (#17)
Re: WIP: Failover Slots

On 22 February 2016 at 23:39, Oleksii Kliukin <alexk@hintbits.com> wrote:

What it’s doing is calling pg_basebackup first to initialize the replica,
and that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR:
requested WAL segment 000000010000000000000000 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica
using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream
--dbname="host=localhost port=5432 user=replicator”

I checked the same command against the git master without the patches
applied and could not reproduce this problem there.

That's a bug. In testing whether we need to return a lower LSN for minimum
WAL for BASE_BACKUP it failed to properly test for InvalidXLogRecPtr . Good
catch.

On the code level, I have no comments on 0001, it’s well documented and I
have no questions about the approach, although I might be not too
knowledgable to judge the specifics of the implementation.

The first patch is the most important IMO, and the one I think needs the
most thought since it's ... well, timelines aren't simple.

slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

That's an escapee log statement I thought I'd already rebased out. Well
spotted, fixed.

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

That's an editing error on my part that I'll reverse. Since the prototype
declares (void) it doesn't matter, but it's a pointless change. Fixed.

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call
ReplicationSlotSave(), but that pre-dated the WAL-logging of replication
slot changes. Don’t we need to call it now, the same way it’s done for
the logical slots in logical.c:at LogicalConfirmReceivedLocation?

No, it's safe here. All we must ensure is that a slot is advanced on the
replica when it's advanced on the master. For physical slots even that's a
weak requirement, we just have to stop them from falling *too* far behind
and causing too much xlog retention. For logical slots we should ensure we
advance the slot on the replica before any vacuum activity that might
remove catalog tuples still needed by that slot gets replayed. Basically
the difference is that logical slots keep track of the catalog xmin too, so
they have (slightly) stricter requirements.

This patch doesn't touch either of those functions except for
renaming ReplicationSlotsComputeRequiredLSN
to ReplicationSlotsUpdateRequiredLSN . Which, by the way, I really don't
like doing, but I couldn't figure out a name to give the function that
computes-and-returns the required LSN that wouldn't be even more confusing
in the face of having a ReplicationSlotsComputeRequiredLSN function as
well. Ideas welcome.

Updated patch

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#18)
7 attachment(s)
Re: WIP: Failover Slots

Updated patch

... attached

I've split it up a bit more too, so it's easier to tell what change is for
what and fixed the issues mentioned by Oleksii. I've also removed some
unrelated documentation changes.

Patch 0001, timeline switches for logical decoding, is unchanged since the
last post.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchDownload
From 0692b4c88a8ffa3ffd8f5e083745d616ed152a6f Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:00:09 +0800
Subject: [PATCH 3/8] Retain extra WAL for failover slots in base backups

Change the return value of pg_start_backup(), the BASE_BACKUP walsender
command, etc to report the minimum WAL required by any failover slot if
this is a lower LSN than the redo position so that base backups contain
the WAL required for slots to work.

Add a new backup label entry 'MIN FAILOVER SLOT LSN' that, if present,
indicates the minimum LSN needed by any failover slot that is present in
the base backup. Backup tools should check for this entry and ensure
they retain all xlogs including and after that point.
---
 src/backend/access/transam/xlog.c | 40 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a92f09d..74b7b23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9797,6 +9797,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	XLogRecPtr  slot_startpoint;
 	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -9943,6 +9944,16 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
+			/*
+			 * If failover slots are in use we must retain and transfer WAL
+			 * older than the redo location so that those slots can be replayed
+			 * from after a failover event.
+			 *
+			 * This MUST be at an xlog segment boundary so truncate the LSN
+			 * appropriately.
+			 */
+			slot_startpoint = (ReplicationSlotsComputeRequiredLSN(true)/ XLOG_SEG_SIZE) * XLOG_SEG_SIZE;
+
 			if (backup_started_in_recovery)
 			{
 				XLogRecPtr	recptr;
@@ -10111,6 +10122,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 						 backup_started_in_recovery ? "standby" : "master");
 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+		if (slot_startpoint != InvalidXLogRecPtr)
+			appendStringInfo(&labelfbuf,  "MIN FAILOVER SLOT LSN: %X/%X\n",
+						(uint32)(slot_startpoint>>32), (uint32)slot_startpoint);
+
 
 		/*
 		 * Okay, write the file, or return its contents to caller.
@@ -10204,9 +10219,34 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
+	 *
+	 * pg_basebackup etc expect to use this as the position to start copying
+	 * WAL from, so we should return the minimum of the slot start LSN and the
+	 * current redo position to make sure we get all WAL required by failover
+	 * slots.
+	 *
+	 * The min required LSN for failover slots is also available from the
+	 * 'MIN FAILOVER SLOT LSN' entry in the backup label file.
 	 */
+	if (slot_startpoint != InvalidXLogRecPtr && slot_startpoint < startpoint)
+	{
+		List *history;
+		TimeLineID slot_start_tli;
+
+		/* Min LSN required by a slot may be on an older timeline. */
+		history = readTimeLineHistory(ThisTimeLineID);
+		slot_start_tli = tliOfPointInHistory(slot_startpoint, history);
+		list_free_deep(history);
+
+		if (slot_start_tli < starttli)
+			starttli = slot_start_tli;
+
+		startpoint = slot_startpoint;
+	}
+
 	if (starttli_p)
 		*starttli_p = starttli;
+
 	return startpoint;
 }
 
-- 
2.1.0

0004-Add-the-UI-and-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0004-Add-the-UI-and-for-failover-slots.patchDownload
From 2be7b039e926e80488f7f6d84033e016048f9ab7 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:04:05 +0800
Subject: [PATCH 4/8] Add the UI and for failover slots

Expose failover slots to the user.

Add a new 'failover' argument to pg_create_logical_replication_slot and
pg_create_physical_replication_slot . Accept a new FAILOVER keyword
argument in CREATE_REPLICATION_SLOT on the walsender protocol.
---
 contrib/test_decoding/expected/ddl.out |  3 +++
 contrib/test_decoding/sql/ddl.sql      |  2 ++
 src/backend/catalog/system_views.sql   | 11 ++++++++++-
 src/backend/replication/repl_gram.y    | 13 +++++++++++--
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/slotfuncs.c    |  7 +++++--
 src/backend/replication/walsender.c    |  4 ++--
 src/include/catalog/pg_proc.h          |  4 ++--
 src/include/nodes/replnodes.h          |  1 +
 src/include/replication/slot.h         |  1 +
 10 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 57a1289..5fed500 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -9,6 +9,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 ERROR:  replication slot "regression_slot" already exists
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
+ERROR:  replication slot "regression_slot" already exists
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 ERROR:  replication slot name "Invalid Name" contains invalid character
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index e311c59..dc61ef4 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -4,6 +4,8 @@ SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index abf9a70..fcb877d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -949,12 +949,21 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
     IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-    OUT slot_name name, OUT xlog_position pg_lsn)
+    IN failover boolean DEFAULT false, OUT slot_name name,
+    OUT xlog_position pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_physical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot(
+    IN slot_name name, IN plugin name, IN failover boolean DEFAULT false,
+    OUT slot_name text, OUT xlog_position pg_lsn)
+RETURNS RECORD
+LANGUAGE INTERNAL
+STRICT VOLATILE
+AS 'pg_create_logical_replication_slot';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d93db88..1574f24 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_FAILOVER
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_failover
 
 %%
 
@@ -184,23 +186,25 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->reserve_wal = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +280,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_failover:
+			K_FAILOVER						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index f83ec53..a1d9f10 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -98,6 +98,7 @@ PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
+FAILOVER			{ return K_FAILOVER; }
 
 ","				{ return ','; }
 ";"				{ return ';'; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f430714..a2dfc40 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
@@ -41,6 +42,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	bool 		immediately_reserve = PG_GETARG_BOOL(1);
+	bool		failover = PG_GETARG_BOOL(2);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
@@ -57,7 +59,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, failover);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -96,6 +98,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	Name		plugin = PG_GETARG_NAME(1);
+	bool		failover = PG_GETARG_BOOL(2);
 
 	LogicalDecodingContext *ctx = NULL;
 
@@ -120,7 +123,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, failover);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1583862..efdbfd1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, cmd->failover);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, cmd->failover);
 	}
 
 	initStringInfo(&output_message);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 62b9125..af2b214 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5068,13 +5068,13 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 /* replication slots */
-DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16" "{19,16,19,3220}" "{i,i,o,o}" "{slot_name,immediately_reserve,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 16 16" "{19,16,16,19,3220}" "{i,i,i,o,o}" "{slot_name,immediately_reserve,failover,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
 DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
-DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19" "{19,19,25,3220}" "{i,i,o,o}" "{slot_name,plugin,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
 DATA(insert OID = 3782 (  pg_logical_slot_get_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v u 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,25}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ _null_ pg_logical_slot_get_changes _null_ _null_ _null_ ));
 DESCR("get changes from replication slot");
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..a8fa9d5 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		failover;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index cdcbd37..9e23a29 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
-- 
2.1.0

0005-Document-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0005-Document-failover-slots.patchDownload
From d5e4a369062d642d718ebef57c6935f849eb1121 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:31:13 +0800
Subject: [PATCH 5/8] Document failover slots

---
 doc/src/sgml/func.sgml              | 15 +++++++++-----
 doc/src/sgml/high-availability.sgml | 41 +++++++++++++++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml   |  2 +-
 doc/src/sgml/protocol.sgml          | 19 ++++++++++++++++-
 4 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c0b94bc..649a0c2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17449,7 +17449,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>, <parameter>immediately_reserve</> <type>boolean</> </optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <optional><parameter>immediately_reserve</> <type>boolean</></optional>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17460,7 +17460,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         when <literal>true</>, specifies that the <acronym>LSN</> for this
         replication slot be reserved immediately; otherwise
         the <acronym>LSN</> is reserved on first connection from a streaming
-        replication client. Streaming changes from a physical slot is only
+        replication client. If <literal>failover</literal> is <literal>true</literal>
+        then the slot is created as a failover slot; see <xref
+        linkend="streaming-replication-slots-failover">.
+        Streaming changes from a physical slot is only
         possible with the streaming-replication protocol &mdash;
         see <xref linkend="protocol-replication">. This function corresponds
         to the replication protocol command <literal>CREATE_REPLICATION_SLOT
@@ -17489,7 +17492,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_logical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>)</function></literal>
+        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17497,8 +17500,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new logical (decoding) replication slot named
         <parameter>slot_name</parameter> using the output plugin
-        <parameter>plugin</parameter>.  A call to this function has the same
-        effect as the replication protocol command
+        <parameter>plugin</parameter>. If <literal>failover</literal>
+        is <literal>true</literal> the slot is created as a failover
+        slot; see <xref linkend="streaming-replication-slots-failover">. A call to
+        this function has the same effect as the replication protocol command
         <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
        </entry>
       </row>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6cb690c..4b75175 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -949,6 +949,47 @@ primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
+
+   <sect3 id="streaming-replication-slots-failover" xreflabel="Failover slots">
+     <title>Failover slots</title>
+
+     <para>
+      Normally a replication slot is not preserved across backup and restore
+      (such as by <application>pg_basebackup</application>) and is not
+      replicated to standbys. Slots are <emphasis>automatically
+      dropped</emphasis> when starting up as a streaming replica or in archive
+      recovery (PITR) mode.
+     </para>
+
+     <para>
+      To make it possible to for an application to consistently follow
+      failover when a replica is promoted to a new master a slot may be
+      created as a <emphasis>failover slot</emphasis>. A failover slot may
+      only be created, replayed from or dropped on a master server. Changes to
+      the slot are written to WAL and replicated to standbys. When a standby
+      is promoted applications may connect to the slot on the standby and
+      resume replay from it at a consistent point, as if it was the original
+      master. Failover slots may not be used to replay from a standby before
+      promotion.
+     </para>
+
+     <para>
+      Non-failover slots may be created on and used from a replica. This is
+      currently limited to physical slots as logical decoding is not supported
+      on replica server.
+     </para>
+
+     <para>
+      When a failover slot created on the master has the same name as a
+      non-failover slot on a replica server the non-failover slot will be
+      automatically dropped. Any client currently connected will be
+      disconnected with an error indicating a conflict with recovery. It
+      is strongly recommended that you avoid creating failover slots with
+      the same name as slots on replicas.
+     </para>
+
+   </sect3>
+
   </sect2>
 
   <sect2 id="cascading-replication">
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index e841348..c7b43ed 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -280,7 +280,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     The commands
     <itemizedlist>
      <listitem>
-      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable></literal></para>
+      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable> <optional>FAILOVER</optional></literal></para>
      </listitem>
 
      <listitem>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 522128e..33b6830 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1434,7 +1434,7 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> <optional><literal>RESERVE_WAL</></> | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> } <optional><literal>FAILOVER</></>
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1474,6 +1474,17 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>FAILOVER</></term>
+       <listitem>
+        <para>
+         Create this slot as a <link linkend="streaming-replication-slots-failover">
+         failover slot</link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </listitem>
   </varlistentry>
@@ -1829,6 +1840,12 @@ The commands accepted in walsender mode are:
       to process the output for streaming.
      </para>
 
+     <para>
+      Logical replication automatically follows timeline switches. It is
+      not necessary or possible to supply a <literal>TIMELINE</literal>
+      option like in physical replication.
+     </para>
+
      <variablelist>
       <varlistentry>
        <term><literal>SLOT</literal> <replaceable class="parameter">slot_name</></term>
-- 
2.1.0

0006-Add-failover-to-pg_replication_slots.patchtext/x-patch; charset=US-ASCII; name=0006-Add-failover-to-pg_replication_slots.patchDownload
From 195f331ff3e8968818538b6f892de55e070409fd Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:55:01 +0800
Subject: [PATCH 6/8] Add 'failover' to pg_replication_slots

---
 contrib/test_decoding/expected/ddl.out | 38 ++++++++++++++++++++++++++++------
 contrib/test_decoding/sql/ddl.sql      | 15 ++++++++++++--
 doc/src/sgml/catalogs.sgml             | 10 +++++++++
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/replication/slotfuncs.c    |  6 +++++-
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  3 ++-
 7 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 5fed500..5b2f34a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -61,11 +61,37 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
-    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal 
------------------+---------------+-----------+--------+------------------+-------------------+----------
- regression_slot | test_decoding | logical   | f      | t                | t                 | t
+    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+-----------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ regression_slot | test_decoding | logical   | f      | t                | t                 | t        | f
+(1 row)
+
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+ ?column? 
+----------
+ init
+(1 row)
+
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+   slot_name   |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+---------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ failover_slot | test_decoding | logical   | f      | t                | t                 | t        | t
+(1 row)
+
+SELECT pg_drop_replication_slot('failover_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
 (1 row)
 
 /*
@@ -676,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | active | active_pid | failover | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
+-----------+--------+-----------+--------+----------+--------+------------+----------+------+--------------+-------------+---------------------
 (0 rows)
 
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index dc61ef4..f64b21c 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -24,16 +24,27 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 SELECT pg_drop_replication_slot('repl');
 
-
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
 /* check whether status function reports us, only reproduceable columns */
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
 
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+SELECT pg_drop_replication_slot('failover_slot');
+
 /*
  * Check that changes are handled correctly when interleaved with ddl
  */
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..0a3af1f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5377,6 +5377,16 @@
      </row>
 
      <row>
+      <entry><structfield>failover</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       True if this slot is a failover slot; see
+       <xref linkend="streaming-replication-slots-failover"/>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>xmin</structfield></entry>
       <entry><type>xid</type></entry>
       <entry></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fcb877d..26c02e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -704,6 +704,7 @@ CREATE VIEW pg_replication_slots AS
             D.datname AS database,
             L.active,
             L.active_pid,
+            L.failover,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index a2dfc40..abc450d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -177,7 +177,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 10
+#define PG_GET_REPLICATION_SLOTS_COLS 11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -227,6 +227,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		XLogRecPtr	restart_lsn;
 		XLogRecPtr	confirmed_flush_lsn;
 		pid_t		active_pid;
+		bool		failover;
 		Oid			database;
 		NameData	slot_name;
 		NameData	plugin;
@@ -249,6 +250,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			namecpy(&plugin, &slot->data.plugin);
 
 			active_pid = slot->active_pid;
+			failover = slot->data.failover;
 		}
 		SpinLockRelease(&slot->mutex);
 
@@ -279,6 +281,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
+		values[i++] = BoolGetDatum(failover);
+
 		if (xmin != InvalidTransactionId)
 			values[i++] = TransactionIdGetDatum(xmin);
 		else
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index af2b214..1d175fc 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5072,7 +5072,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
+DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,16,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,failover,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 81bc5c9..d8315c6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1417,11 +1417,12 @@ pg_replication_slots| SELECT l.slot_name,
     d.datname AS database,
     l.active,
     l.active_pid,
+    l.failover,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
     l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, failover, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.1.0

0007-not-for-inclusion-Test-script-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0007-not-for-inclusion-Test-script-for-failover-slots.patchDownload
From fa2cacf7b747e9075dd333646e502068a6280776 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 20 Jan 2016 18:41:37 +0800
Subject: [PATCH 7/8] (not for inclusion): Test script for failover slots

---
 failover-slot-test.sh | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)
 create mode 100755 failover-slot-test.sh

diff --git a/failover-slot-test.sh b/failover-slot-test.sh
new file mode 100755
index 0000000..1d0afb5
--- /dev/null
+++ b/failover-slot-test.sh
@@ -0,0 +1,264 @@
+#!/bin/bash
+
+set -e -u
+
+ulimit -c unlimited
+
+# dump shmem in cores
+echo 15 > /proc/self/coredump_filter
+
+DATADIR=96slottest
+PGPORT_MASTER=5144
+PGPORT_REPLICA=$(( $PGPORT_MASTER + 1 ))
+export PGUSER=postgres
+export PGDATABASE=postgres
+
+export PATH=$HOME/pg/96/bin:$PATH
+
+if [ -e $DATADIR ]; then
+    pg_ctl -D $DATADIR -w stop -m immediate || true
+    rm -rf $DATADIR ${DATADIR}.log
+fi
+
+if [ -e ${DATADIR}-replica ]; then
+    pg_ctl -D ${DATADIR}-replica -w stop -m immediate || true
+    rm -rf ${DATADIR}-replica ${DATADIR}-replica.log
+fi
+
+rm -rf xlogs
+
+postmaster_opts='-c max_replication_slots=12 -c wal_level=logical -c max_wal_senders=10 -c track_commit_timestamp=on -c wal_keep_segments=100 -c log_min_messages=debug2 -c log_error_verbosity=verbose'
+
+echo "Initdb'ing master"
+initdb -D $DATADIR -A trust -N -U postgres > ${DATADIR}-initdb.log
+
+cat > $DATADIR/pg_hba.conf <<'__END__'
+# TYPE  DATABASE        USER            ADDRESS                 METHOD
+local   all             all                                     trust
+host    all             all             127.0.0.1/32            trust
+host    all             all             ::1/128                 trust
+local   replication     postgres                                trust
+host    replication     postgres        127.0.0.1/32            trust
+host    replication     postgres        ::1/128                 trust
+__END__
+
+echo "Starting master"
+PGPORT=$PGPORT_MASTER pg_ctl -l ${DATADIR}.log -D $DATADIR -w start -o "$postmaster_opts"
+
+# A function to wait for replica catchup
+PGPORT=$PGPORT_MASTER psql <<'__END__'
+CREATE OR REPLACE FUNCTION public.pg_xlog_wait_remote_apply(i_pos pg_lsn, i_pid integer) RETURNS VOID
+AS $FUNC$
+BEGIN
+    WHILE EXISTS(SELECT true FROM pg_stat_get_wal_senders() s WHERE s.flush_location < i_pos AND (i_pid = 0 OR s.pid = i_pid)) LOOP
+                PERFORM pg_sleep(0.01);
+        END LOOP;
+END;$FUNC$
+LANGUAGE plpgsql;
+__END__
+
+wait_for_catchup()
+{
+  PGPORT=$PGPORT_MASTER psql -c "SELECT pg_xlog_wait_remote_apply(pg_current_xlog_location(), 0);"
+}
+
+print_slots_replica()
+{
+  echo "Replica slots:"
+  PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_replication_slots"
+}
+
+print_slots_master()
+{
+  echo "Master slots:"
+  PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_replication_slots"
+}
+
+print_slots()
+{
+    print_slots_master
+    print_slots_replica
+}
+
+PGPID=$(pg_ctl -D ${DATADIR} status | awk '/(PID:.*)/ { found = match($0, "\\(PID:\\s([0-9]*)\\)", arr); if (found) { print arr[1]; } }')
+echo "Master postmaster PID is ${PGPID}"
+
+echo "Creating before_basebackup slots"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_logical_replication_slot('before_basebackup', 'test_decoding', true);"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_logical_replication_slot('before_basebackup_nf', 'test_decoding');"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_physical_replication_slot('before_basebackup_ph', true, true);"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_physical_replication_slot('before_basebackup_ph_nf');"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_logical_replication_slot('drop_test', 'test_decoding', true);"
+
+print_slots_master
+
+# Crash the master and restart it. The main point here is to ensure we don't
+# delete non-failover slots during normal crash recovery.
+PGPORT=$PGPORT_MASTER pg_ctl -D $DATADIR -w stop -m immediate
+PGPORT=$PGPORT_MASTER pg_ctl -l ${DATADIR}.log -D $DATADIR -w start -o "$postmaster_opts"
+
+sleep 1
+print_slots_master
+
+echo "Making basebackup"
+PGPORT=$PGPORT_MASTER pg_basebackup -D ${DATADIR}-replica -X stream -R
+
+echo "replica xlogs:"
+ls ${DATADIR}-replica/pg_xlog
+
+echo "Starting the replica"
+PGPORT=$PGPORT_REPLICA pg_ctl -l ${DATADIR}-replica.log -D ${DATADIR}-replica -w start -o "$postmaster_opts -c hot_standby=on"
+
+PGPID=$(pg_ctl -D "${DATADIR}-replica" status | awk '/(PID:.*)/ { found = match($0, "\\(PID:\\s([0-9]*)\\)", arr); if (found) { print arr[1]; } }')
+echo "Replica postmaster PID is ${PGPID}"
+
+#echo "---- attach now -----"
+#sleep 20
+#echo "---------------------"
+
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_logical_replication_slot('after_basebackup', 'test_decoding', true);"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_logical_replication_slot('after_basebackup_nf', 'test_decoding');"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_physical_replication_slot('after_basebackup_ph', true, true);"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_physical_replication_slot('after_basebackup_ph_nf');"
+
+echo "Slots after creation of before and after slots:"
+print_slots
+
+# expect this to fail
+set +e
+echo "Attempting creation of non-failover logical slot on replica (WILLFAIL)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_create_logical_replication_slot('on_replica', 'test_decoding');"
+echo "Attempting creation of failover logical slot on replica (WILLFAIL)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_create_logical_replication_slot('on_replica', 'test_decoding', true);"
+echo "Attempting creation of failover physical slot on replica (WILLFAIL)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_create_physical_replication_slot('on_replica', true, true);"
+set -e
+
+# this must succeed. Using pg_receivexlog just for a different interface; it'd
+# be fine to use SQL instead, the two are equivalent.
+echo "Creating non-failover physical slot on replica"
+PGPORT=$PGPORT_REPLICA pg_receivexlog -S "phys_conflict_test" --create-slot
+
+wait_for_catchup
+
+mkdir -p xlogs
+
+echo "Attempt replay from physical replica failover slot (WILLFAIL)"
+set +e
+PGPORT=$PGPORT_REPLICA pg_receivexlog -S "after_basebackup_ph" -D xlogs --no-loop
+set -e
+
+echo "Start replay from physical replica non-replay slot"
+PGPORT=$PGPORT_REPLICA pg_receivexlog -S "phys_conflict_test" -D xlogs --no-loop &
+
+print_slots_replica
+
+# Now make a failover slot on the master with the same name. Rather than causing fireworks
+# this should terminate our pg_receivexlog and clobber the slot.
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_create_physical_replication_slot('phys_conflict_test', true, true);"
+
+# pg_recievexlog should terminate
+wait
+
+# drop the slot from the replica
+PGPORT=$PGPORT_MASTER pg_receivexlog -S "phys_conflict_test" --drop-slot
+
+# Test dropping of a failover slot on the master
+PGPORT=$PGPORT_MASTER pg_receivexlog -S "drop_test" --drop-slot
+
+wait_for_catchup
+echo "Slots:"
+print_slots
+
+# do some writes
+PGPORT=$PGPORT_MASTER psql -c "CREATE TABLE test_tab(blah text, msg text); INSERT INTO test_tab(blah, msg) SELECT x::text, 'onmaster-beforeread' FROM generate_series(0,1) x;"
+
+wait_for_catchup
+echo "Slots:"
+print_slots
+
+# Attempt to read from replica's slots. This must fail, but not with an
+# error indicating that the slots are missing.
+set +e
+echo "REPLICA: Trying to replay logical slot before_basebackup from replica (WILLFAIL)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+echo "REPLICA: Trying to replay logical slot after_basebackup from replica (WILLFAIL)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+set -e
+
+# Read from the master. These must succeed.
+echo "MASTER: Trying to replay logical slot before_basebackup from master (expect rows 0,1)"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+echo "MASTER: Trying to replay logical slot after_basebackup from master (expect rows 0,1)"
+PGPORT=$PGPORT_MASTER psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+
+# do some more writes
+#
+# Because we replayed the slot up to the first rows the failover slot replay should only see these
+# rows not the first ones.
+#
+PGPORT=$PGPORT_MASTER psql -c "INSERT INTO test_tab(blah, msg) SELECT x::text, 'onmaster-afterread' FROM generate_series(2,3) x;"
+
+wait_for_catchup
+print_slots
+
+# kill the master and promote the replica
+echo "Killing master"
+PGPORT=$PGPORT_MASTER pg_ctl -D $DATADIR -w stop -m fast
+echo "Promoting replica"
+PGPORT=$PGPORT_REPLICA pg_ctl -D ${DATADIR}-replica -w promote
+
+sleep 1
+
+if ! PGPORT=$PGPORT_REPLICA pg_isready -t 10 ; then
+    echo "Unable to connect to promoted replica"
+    exit 1
+fi
+
+while test "$(PGPORT=$PGPORT_REPLICA psql -qAtc 'SELECT pg_is_in_recovery()')" = 't'
+do
+    echo "Waiting for promotion to finish..."
+    sleep 1
+done
+
+sleep 1
+echo "Replica up"
+
+print_slots_replica
+
+# do some more writes on the promoted replica / new master
+# These should be visible to both slots, but possibly only after a second call
+# (since we don't follow the timeline switch within one call)
+PGPORT=$PGPORT_REPLICA psql -c "INSERT INTO test_tab(blah, msg) SELECT x::text, 'onreplica' FROM generate_series(4,5) x;"
+
+echo "Final LSN is $(PGPORT=$PGPORT_REPLICA psql -qAt -c "SELECT pg_current_xlog_insert_location();")"
+
+# read from the slots again. They should still exist and be readable.
+echo "REPLICA: Slot changes for after_basebackup from promoted replica (expect rows 2+)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+echo "REPLICA: Slot changes for before_basebackup from promoted replica (expect rows 2+)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+
+print_slots_replica
+
+echo "--- restarting replica ---"
+
+echo "(stopping)"
+PGPORT=$PGPORT_REPLICA pg_ctl -D ${DATADIR}-replica -w stop -m fast
+echo "(starting)"
+PGPORT=$PGPORT_REPLICA pg_ctl -l ${DATADIR}-replica.log -D ${DATADIR}-replica -w start -o "$postmaster_opts"
+
+sleep 1
+print_slots_replica
+
+# Should work and only replay whatever was left, if anything
+set +e
+echo "REPLICA: Slot changes for after_basebackup (expect remaining rows)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);"
+echo "REPLICA: Slot changes for before_basebackup (expect remaining rows)"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+PGPORT=$PGPORT_REPLICA psql -c "SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);"
+set -e
-- 
2.1.0

0001-Allow-logical-slots-to-follow-timeline-switches.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-logical-slots-to-follow-timeline-switches.patchDownload
From 4a0c0b8fbdb586b9203a28edeeb9ebfd6062d4d8 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 11 Feb 2016 10:44:14 +0800
Subject: [PATCH 1/8] Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c        |  43 ++++-
 src/backend/access/transam/xlogutils.c         | 214 +++++++++++++++++++++++--
 src/backend/replication/logical/logicalfuncs.c |  38 ++++-
 src/include/access/xlogreader.h                |  33 +++-
 src/include/access/xlogutils.h                 |   2 +
 5 files changed, 295 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..5899f44 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 
@@ -116,6 +119,9 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+	/* Will be loaded on first read */
+	state->timelineHistory = NULL;
+
 	return state;
 }
 
@@ -135,6 +141,13 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifdef FRONTEND
+	/* FE code doesn't use this and we can't list_free_deep on FE */
+	Assert(state->timelineHistory == NULL);
+#else
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,9 +221,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 
 	if (RecPtr == InvalidXLogRecPtr)
 	{
+		/* No explicit start point, read the record after the one we just read */
 		RecPtr = state->EndRecPtr;
 
 		if (state->ReadRecPtr == InvalidXLogRecPtr)
+			/* allow readPageTLI to go backward */
 			randAccess = true;
 
 		/*
@@ -223,6 +238,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	else
 	{
 		/*
+		 * Caller supplied a position to start at.
+		 *
 		 * In this case, the passed-in record pointer should already be
 		 * pointing to a valid record starting position.
 		 */
@@ -309,8 +326,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		/* XXX: more validation should be done here */
 		if (total_len < SizeOfXLogRecord)
 		{
-			report_invalid_record(state, "invalid record length at %X/%X",
-								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+								  SizeOfXLogRecord, total_len);
 			goto err;
 		}
 		gotheader = false;
@@ -466,9 +484,7 @@ err:
 	 * Invalidate the xlog page we've cached. We might read from a different
 	 * source after failure.
 	 */
-	state->readSegNo = 0;
-	state->readOff = 0;
-	state->readLen = 0;
+	XLogReaderInvalCache(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -599,9 +615,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 {
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X",
-							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		report_invalid_record(state, "invalid record length at %X/%X: wanted %lu, got %u",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+							  SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
 	if (record->xl_rmid > RM_MAX_ID)
@@ -1337,3 +1353,14 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 
 	return true;
 }
+
+/*
+ * Invalidate the xlog reader's cached page to force a re-read
+ */
+void
+XLogReaderInvalCache(XLogReaderState *state)
+{
+	state->readSegNo = 0;
+	state->readOff = 0;
+	state->readLen = 0;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 444e218..85bac01 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -7,6 +7,9 @@
  * This file contains support routines that are used by XLOG replay functions.
  * None of this code is used during normal system operation.
  *
+ * Unlike xlogreader.c this is only compiled for the backend so it may use
+ * elog, etc.
+ *
  *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -21,6 +24,7 @@
 
 #include "miscadmin.h"
 
+#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -651,6 +655,8 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
 	static uint32 sendOff = 0;
+	/* So we notice if asked for the same seg on a new tli: */
+	static TimeLineID lastTLI = 0;
 
 	p = buf;
 	recptr = startptr;
@@ -664,11 +670,11 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 
 		startoff = recptr % XLogSegSize;
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		/* Do we need to switch to a new xlog segment? */
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || lastTLI != tli)
 		{
 			char		path[MAXPGPATH];
 
-			/* Switch to another logfile segment */
 			if (sendFile >= 0)
 				close(sendFile);
 
@@ -692,6 +698,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 									path)));
 			}
 			sendOff = 0;
+			lastTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -759,28 +766,66 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 
 	loc = targetPagePtr + reqLen;
+
+	/* Make sure enough xlog is available... */
 	while (1)
 	{
 		/*
-		 * TODO: we're going to have to do something more intelligent about
-		 * timelines on standbys. Use readTimeLineHistory() and
-		 * tliOfPointInHistory() to get the proper LSN? For now we'll catch
-		 * that case earlier, but the code and TODO is left in here for when
-		 * that changes.
+		 * Check which timeline to get the record from.
+		 *
+		 * We have to do it after each loop because if we're in
+		 * recovery as a cascading standby the current timeline
+		 * might've become historical.
 		 */
-		if (!RecoveryInProgress())
+		XLogReadDetermineTimeline(state);
+
+		if (state->currTLI == ThisTimeLineID)
 		{
-			*pageTLI = ThisTimeLineID;
-			flushptr = GetFlushRecPtr();
+			/*
+			 * We're reading from the current timeline so we might
+			 * have to wait for the desired record to be generated
+			 * (or, for a standby, received & replayed)
+			 */
+			if (!RecoveryInProgress())
+			{
+				*pageTLI = ThisTimeLineID;
+				flushptr = GetFlushRecPtr();
+			}
+			else
+				flushptr = GetXLogReplayRecPtr(pageTLI);
+
+			if (loc <= flushptr)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(1000L);
 		}
 		else
-			flushptr = GetXLogReplayRecPtr(pageTLI);
-
-		if (loc <= flushptr)
+		{
+			/*
+			 * We're on a historical timeline, limit reading to the
+			 * switch point where we moved to the next timeline.
+			 *
+			 * We could just jump to the next timeline early since
+			 * the whole segment the last page is on got copied onto
+			 * the new timeline, but this is simpler.
+			 */
+			flushptr = state->currTLIValidUntil;
+
+			/*
+			 * FIXME: Setting pageTLI to the TLI the *record* we
+			 * want is on can be slightly wrong; the page might
+			 * begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will've been copied
+			 * from the prior timeline. We should really read the
+			 * page header. It's pretty harmless though as nothing
+			 * cares so long as the timeline doesn't go backwards.
+			 */
+			*pageTLI = state->currTLI;
+
+			/* No need to wait on a historical timeline */
 			break;
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(1000L);
+		}
 	}
 
 	/* more than one block available */
@@ -793,7 +838,142 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else
 		count = flushptr - targetPagePtr;
 
-	XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, *pageTLI, targetPagePtr, count);
 
 	return count;
 }
+
+/*
+ * Figure out what timeline to look on for the record the xlogreader
+ * is being asked asked to read, in currRecPtr. This may be used
+ * to determine which xlog segment file to open, etc.
+ *
+ * It depends on:
+ *
+ * - Whether we're reading a record immediately following one we read
+ *   before or doing a random read. We can only use the cached
+ *   timeline info if we're reading sequentially.
+ *
+ * - Whether the timeline of the prior record read was historical or
+ *   the current timeline and, if historical, on where it's valid up
+ *   to. On a historical timeline we need to avoid reading past the
+ *   timeline switch point. The records after it are probably invalid,
+ *   but worse, they might be valid but *different*.
+ *
+ * - If the current timeline became historical since the last record
+ *   we read. We need to make sure we don't read past the switch
+ *   point.
+ *
+ * None of this has any effect unless callbacks use currTLI to
+ * determine which timeline to read from and optionally use the
+ * validity limit to avoid reading past the valid end of a page.
+ *
+ * Note that an xlog segment may contain data from an older timeline
+ * if it was copied during a timeline switch. Callers may NOT assume
+ * that currTLI is the timeline that will be in a given page's
+ * xlp_tli; the page may begin on older timeline.
+ */
+void
+XLogReadDetermineTimeline(XLogReaderState *state)
+{
+	if (state->timelineHistory == NULL)
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+	if (state->currTLIValidUntil == InvalidXLogRecPtr &&
+		state->currTLI != ThisTimeLineID &&
+		state->currTLI != 0)
+	{
+		/*
+		 * We were reading what was the current timeline but it became
+		 * historical. Either we were replaying as a replica and got
+		 * promoted or we're replaying as a cascading replica from a
+		 * parent that got promoted.
+		 *
+		 * Force a re-read of the timeline history.
+		 */
+		list_free_deep(state->timelineHistory);
+		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
+
+		elog(DEBUG2, "timeline %u became historical during decoding",
+				state->currTLI);
+
+		/* then invalidate the timeline info so we read again */
+		state->currTLI = 0;
+	}
+
+	if (state->currRecPtr == state->EndRecPtr &&
+		state->currTLIValidUntil != InvalidXLogRecPtr &&
+		state->currRecPtr >= state->currTLIValidUntil)
+	{
+		/*
+		 * We're reading the immedately following record but we're at
+		 * a timeline boundary and must read the next record from the
+		 * new TLI.
+		 */
+		elog(DEBUG2, "Requested record %X/%X is after end of cur TLI %u "
+				"valid until %X/%X, switching to next timeline",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil));
+
+		/* Invalidate TLI info so we look it up again */
+		state->currTLI = 0;
+		state->currTLIValidUntil = InvalidXLogRecPtr;
+	}
+
+	if (state->currRecPtr != state->EndRecPtr ||
+		state->currTLI == 0)
+	{
+		/*
+		 * Something changed. We're not reading the record immediately
+		 * after the one we just read, the previous record was at
+		 * timeline boundary or we didn't yet determine the timeline
+		 * to read from.
+		 *
+		 * Work out what timeline to read this record from.
+		 */
+		state->currTLI = tliOfPointInHistory(state->currRecPtr,
+				state->timelineHistory);
+
+		if (state->currTLI != ThisTimeLineID)
+		{
+			/*
+			 * It's on a historical timeline.
+			 *
+			 * We'll probably read more records after this so make a
+			 * note of the point at we have to stop reading and do
+			 * another TLI switch.
+			 *
+			 * Callbacks can also use this to avoid reading past the
+			 * valid end of the TLI.
+			 */
+			state->currTLIValidUntil = tliSwitchPoint(state->currTLI,
+					state->timelineHistory, NULL);
+		}
+		else
+		{
+			/*
+			 * We're on the current timeline. The callback can use the
+			 * xlog flush position and we don't have to worry about
+			 * the TLI ending.
+			 *
+			 * If we're in recovery from another standby (cascading)
+			 * we could receive a new timeline, making the current
+			 * timeline historical. We check that by comparing currTLI
+			 * again at each record read.
+			 */
+			state->currTLIValidUntil = InvalidXLogRecPtr;
+		}
+
+		elog(DEBUG2, "XLog read ptr %X/%X is on tli %u valid until %X/%X, current tli is %u",
+				(uint32)(state->currRecPtr >> 32),
+				(uint32)state->currRecPtr,
+				state->currTLI,
+				(uint32)(state->currTLIValidUntil >> 32),
+				(uint32)(state->currTLIValidUntil),
+				ThisTimeLineID);
+	}
+}
+
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f789fc1..f29fca3 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -231,12 +231,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	rsinfo->setResult = p->tupstore;
 	rsinfo->setDesc = p->tupdesc;
 
-	/* compute the current end-of-wal */
-	if (!RecoveryInProgress())
-		end_of_wal = GetFlushRecPtr();
-	else
-		end_of_wal = GetXLogReplayRecPtr(NULL);
-
 	ReplicationSlotAcquire(NameStr(*name));
 
 	PG_TRY();
@@ -263,6 +257,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		ctx->output_writer_private = p;
 
+		/*
+		 * We start reading xlog from the restart lsn, even though in
+		 * CreateDecodingContext we set the snapshot builder up using the
+		 * slot's candidate_restart_lsn. This means we might read xlog we don't
+		 * actually decode rows from, but the snapshot builder might need it to
+		 * get to a consistent point. The point we start returning data to
+		 * *users* at is the candidate restart lsn from the decoding context.
+		 */
 		startptr = MyReplicationSlot->data.restart_lsn;
 
 		CurrentResourceOwner = ResourceOwnerCreate(CurrentResourceOwner, "logical decoding");
@@ -270,8 +272,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
+		if (!RecoveryInProgress())
+			end_of_wal = GetFlushRecPtr();
+		else
+			end_of_wal = GetXLogReplayRecPtr(NULL);
+
+		/* Decode until we run out of records */
 		while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
-			 (ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
+			 (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
 		{
 			XLogRecord *record;
 			char	   *errm = NULL;
@@ -280,6 +288,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			if (errm)
 				elog(ERROR, "%s", errm);
 
+			/*
+			 * Now that we've set up the xlog reader state subsequent calls
+			 * pass InvalidXLogRecPtr to say "continue from last record"
+			 */
 			startptr = InvalidXLogRecPtr;
 
 			/*
@@ -299,6 +311,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			CHECK_FOR_INTERRUPTS();
 		}
 
+		/* Make sure timeline lookups use the start of the next record */
+		startptr = ctx->reader->EndRecPtr;
+
+		/*
+		 * The XLogReader will read a page past the valid end of WAL
+		 * because it doesn't know about timelines. When we switch
+		 * timelines and ask it for the first page on the new timeline it
+		 * will think it has it cached, but it'll have the old partial
+		 * page and say it can't find the next record. So flush the cache.
+		 */
+		XLogReaderInvalCache(ctx->reader);
+
 		tuplestore_donestoring(tupstore);
 
 		CurrentResourceOwner = old_resowner;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 7553cc4..4ccee95 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -20,12 +20,16 @@
  *		with the XLogRec* macros and functions. You can also decode a
  *		record that's already constructed in memory, without reading from
  *		disk, by calling the DecodeXLogRecord() function.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-------------------------------------------------------------------------
  */
 #ifndef XLOGREADER_H
 #define XLOGREADER_H
 
 #include "access/xlogrecord.h"
+#include "nodes/pg_list.h"
 
 typedef struct XLogReaderState XLogReaderState;
 
@@ -139,26 +143,46 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+	/*
+	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to
+	 * at least readLen bytes)
+	 */
 	char	   *readBuf;
 
-	/* last read segment, segment offset, read length, TLI */
+	/*
+	 * last read segment, segment offset, read length, TLI for
+	 * data currently in readBuf.
+	 */
 	XLogSegNo	readSegNo;
 	uint32		readOff;
 	uint32		readLen;
 	TimeLineID	readPageTLI;
 
-	/* beginning of last page read, and its TLI  */
+	/*
+	 * beginning of prior page read, and its TLI. Doesn't
+	 * necessarily correspond to what's in readBuf, used for
+	 * timeline sanity checks.
+	 */
 	XLogRecPtr	latestPagePtr;
 	TimeLineID	latestPageTLI;
 
 	/* beginning of the WAL record being read. */
 	XLogRecPtr	currRecPtr;
+	/* timeline to read it from, 0 if a lookup is required */
+	TimeLineID  currTLI;
+	/*
+	 * Endpoint of timeline in currTLI if it's historical or
+	 * InvalidXLogRecPtr if currTLI is the current timeline.
+	 */
+	XLogRecPtr	currTLIValidUntil;
 
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
+	/* cached timeline history */
+	List	   *timelineHistory;
+
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 };
@@ -174,6 +198,9 @@ extern void XLogReaderFree(XLogReaderState *state);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 			   XLogRecPtr recptr, char **errormsg);
 
+/* Flush any cached page */
+extern void XLogReaderInvalCache(XLogReaderState *state);
+
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif   /* FRONTEND */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 1b9abce..86df8cf 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -50,4 +50,6 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 extern int read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI);
 
+extern void XLogReadDetermineTimeline(XLogReaderState *state);
+
 #endif
-- 
2.1.0

0002-Allow-replication-slots-to-follow-failover.patchtext/x-patch; charset=US-ASCII; name=0002-Allow-replication-slots-to-follow-failover.patchDownload
From 45d3d96ab236f1621c79ce05556acf7aace14a7b Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:59:37 +0800
Subject: [PATCH 2/8] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile       |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  65 ++++
 src/backend/access/transam/rmgr.c          |   1 +
 src/backend/access/transam/xlog.c          |   5 +-
 src/backend/commands/dbcommands.c          |   3 +
 src/backend/replication/basebackup.c       |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  25 +-
 src/backend/replication/slot.c             | 586 +++++++++++++++++++++++++++--
 src/backend/replication/slotfuncs.c        |   4 +-
 src/backend/replication/walsender.c        |   8 +-
 src/bin/pg_xlogdump/replslotdesc.c         |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c             |   1 +
 src/include/access/rmgrlist.h              |   1 +
 src/include/replication/slot.h             |  69 +---
 src/include/replication/slot_xlog.h        | 100 +++++
 16 files changed, 755 insertions(+), 129 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..5829e8d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "of slot %s with restart %X/%X and xid %u confirmed to %X/%X",
+						NameStr(xlrec->name),
+						(uint32)(xlrec->restart_lsn>>32), (uint32)(xlrec->restart_lsn),
+						xlrec->xmin,
+						(uint32)(xlrec->confirmed_flush>>32), (uint32)(xlrec->confirmed_flush));
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "of slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 94b79ac..a92f09d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6366,8 +6366,11 @@ StartupXLOG(void)
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
+	 *
+	 * If we're in archive recovery then non-failover slots are no
+	 * longer of any use and should be dropped during startup.
 	 */
-	StartupReplicationSlots();
+	StartupReplicationSlots(ArchiveRecoveryRequested);
 
 	/*
 	 * Startup logical state, needs to be setup now so we have proper data
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 88c3a49..76fc5c7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..4feb2ca 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There's some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we need we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
@@ -272,7 +275,7 @@ CreateInitDecodingContext(char *plugin,
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
 
-	ReplicationSlotsComputeRequiredXmin(true);
+	ReplicationSlotsUpdateRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -908,8 +911,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			MyReplicationSlot->effective_catalog_xmin = MyReplicationSlot->data.catalog_xmin;
 			SpinLockRelease(&MyReplicationSlot->mutex);
 
-			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
+			ReplicationSlotsUpdateRequiredXmin(false);
+			ReplicationSlotsUpdateRequiredLSN();
 		}
 	}
 	else
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..915c0af 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -24,7 +24,18 @@
  * directory. Inside that directory the state file will contain the slot's
  * own data. Additional data can be stored alongside that file if required.
  * While the server is running, the state data is also cached in memory for
- * efficiency.
+ * efficiency. Non-failover slots are NOT subject to WAL logging and may
+ * be used on standbys (though that's only supported for physical slots at
+ * the moment). They use tempfile writes and swaps for crash safety.
+ *
+ * A failover slot created on a master node generates WAL records that
+ * maintain a copy of the slot on standby nodes. If a standby node is
+ * promoted the failover slot allows access to be restarted just as if the
+ * the original master node was being accessed, allowing for the timeline
+ * change. The replica considers slot positions when removing WAL to make
+ * sure it can satisfy the needs of slots after promotion.  For logical
+ * decoding slots the slot's internal state is kept up to date so it's
+ * ready for use after promotion.
  *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -101,10 +113,14 @@ static LWLockTranche ReplSlotIOLWLockTranche;
 static void ReplicationSlotDropAcquired(void);
 
 /* internal persistency functions */
-static void RestoreSlotFromDisk(const char *name);
+static void RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -220,7 +236,8 @@ ReplicationSlotValidateName(const char *name, int elevel)
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency persistency)
+					  ReplicationSlotPersistency persistency,
+					  bool failover)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -278,6 +295,15 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/* Slot timeline is unused and always zero */
+	slot->data.restart_tli = 0;
+
+	if (failover && RecoveryInProgress())
+		ereport(ERROR,
+				(errmsg("a failover slot may not be created on a replica"),
+				 errhint("Create the slot on the master server instead")));
+
+	slot->data.failover = failover;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -313,6 +339,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -335,7 +365,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && s->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -349,12 +383,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -411,6 +457,9 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
 ReplicationSlotDropAcquired(void)
@@ -418,9 +467,14 @@ ReplicationSlotDropAcquired(void)
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false,
+		 took_allocation_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -428,8 +482,27 @@ ReplicationSlotDropAcquired(void)
 	 * If some other backend ran this code concurrently with us, we might try
 	 * to delete a slot with a certain name while someone else was trying to
 	 * create a slot with the same name.
+	 *
+	 * If called with the lock already held it MUST be held in
+	 * EXCLUSIVE mode.
 	 */
-	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotAllocationLock))
+	{
+		took_allocation_lock = true;
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	}
+
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), NAMEDATALEN);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
 
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
@@ -459,7 +532,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -477,18 +554,27 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
 	 * limits.
 	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	/*
 	 * If removing the directory fails, the worst thing that will happen is
@@ -504,7 +590,8 @@ ReplicationSlotDropAcquired(void)
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
-	LWLockRelease(ReplicationSlotAllocationLock);
+	if (took_allocation_lock)
+		LWLockRelease(ReplicationSlotAllocationLock);
 }
 
 /*
@@ -544,6 +631,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -565,7 +655,7 @@ ReplicationSlotPersist(void)
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  */
 void
-ReplicationSlotsComputeRequiredXmin(bool already_locked)
+ReplicationSlotsUpdateRequiredXmin(bool already_locked)
 {
 	int			i;
 	TransactionId agg_xmin = InvalidTransactionId;
@@ -610,10 +700,20 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 }
 
 /*
- * Compute the oldest restart LSN across all slots and inform xlog module.
+ * Update the xlog module's copy of the minimum restart lsn across all slots
  */
 void
-ReplicationSlotsComputeRequiredLSN(void)
+ReplicationSlotsUpdateRequiredLSN(void)
+{
+	XLogSetReplicationSlotMinimumLSN(ReplicationSlotsComputeRequiredLSN(false));
+}
+
+/*
+ * Compute the oldest restart LSN across all slots (or optionally
+ * only failover slots) and return it.
+ */
+XLogRecPtr
+ReplicationSlotsComputeRequiredLSN(bool failover_only)
 {
 	int			i;
 	XLogRecPtr	min_required = InvalidXLogRecPtr;
@@ -625,14 +725,19 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		failover;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		failover = s->data.failover;
 		SpinLockRelease(&s->mutex);
 
+		if (failover_only && !failover)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -640,7 +745,7 @@ ReplicationSlotsComputeRequiredLSN(void)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	XLogSetReplicationSlotMinimumLSN(min_required);
+	return min_required;
 }
 
 /*
@@ -649,7 +754,7 @@ ReplicationSlotsComputeRequiredLSN(void)
  * Returns InvalidXLogRecPtr if logical decoding is disabled or no logical
  * slots exist.
  *
- * NB: this returns a value >= ReplicationSlotsComputeRequiredLSN(), since it
+ * NB: this returns a value >= ReplicationSlotsUpdateRequiredLSN(), since it
  * ignores physical replication slots.
  *
  * The results aren't required frequently, so we don't maintain a precomputed
@@ -747,6 +852,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -779,12 +923,13 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
+	 * ReplicationSlotsUpdateRequiredLSN() is used. In the unlikely case that
 	 * this happens we'll just retry.
 	 */
 	while (true)
@@ -821,12 +966,12 @@ ReplicationSlotReserveWal(void)
 		}
 
 		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
+		 * concurrent ReplicationSlotsUpdateRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
 		 * more than twice.
 		 */
@@ -878,7 +1023,7 @@ CheckPointReplicationSlots(void)
  * needs to be run before we start crash recovery.
  */
 void
-StartupReplicationSlots(void)
+StartupReplicationSlots(bool drop_nonfailover_slots)
 {
 	DIR		   *replication_dir;
 	struct dirent *replication_de;
@@ -917,7 +1062,7 @@ StartupReplicationSlots(void)
 		}
 
 		/* looks like a slot in a normal state, restore */
-		RestoreSlotFromDisk(replication_de->d_name);
+		RestoreSlotFromDisk(replication_de->d_name, drop_nonfailover_slots);
 	}
 	FreeDir(replication_dir);
 
@@ -926,8 +1071,8 @@ StartupReplicationSlots(void)
 		return;
 
 	/* Now that we have recovered all the data, compute replication xmin */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 }
 
 /* ----
@@ -996,6 +1141,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -1006,15 +1153,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1047,6 +1197,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1116,7 +1285,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
  * Load a single slot from disk into memory.
  */
 static void
-RestoreSlotFromDisk(const char *name)
+RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots)
 {
 	ReplicationSlotOnDisk cp;
 	int			i;
@@ -1235,10 +1404,21 @@ RestoreSlotFromDisk(const char *name)
 						path, checksum, cp.checksum)));
 
 	/*
-	 * If we crashed with an ephemeral slot active, don't restore but delete
-	 * it.
+	 * If we crashed with an ephemeral slot active, don't restore but
+	 * delete it.
+	 *
+	 * Similarly, if we're in archive recovery and will be running as
+	 * a standby (when drop_nonfailover_slots is set), non-failover
+	 * slots can't be relied upon. Logical slots might have a catalog
+	 * xmin lower than reality because the original slot on the master
+	 * advanced past the point the stale slot on the replica is stuck
+	 * at. Additionally slots might have been copied while being
+	 * written to if the basebackup copy method was not atomic.
+	 * Failover slots are safe since they're WAL-logged and follow the
+	 * master's slot position.
 	 */
-	if (cp.slotdata.persistency != RS_PERSISTENT)
+	if (cp.slotdata.persistency != RS_PERSISTENT
+			|| (drop_nonfailover_slots && !cp.slotdata.failover))
 	{
 		sprintf(path, "pg_replslot/%s", name);
 
@@ -1249,6 +1429,14 @@ RestoreSlotFromDisk(const char *name)
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
 		fsync_fname("pg_replslot", true);
+
+		if (cp.slotdata.persistency == RS_PERSISTENT)
+		{
+			ereport(LOG,
+					(errmsg("dropped non-failover slot %s during archive recovery",
+							 NameStr(cp.slotdata.name))));
+		}
+
 		return;
 	}
 
@@ -1285,5 +1473,331 @@ RestoreSlotFromDisk(const char *name)
 	if (!restored)
 		ereport(PANIC,
 				(errmsg("too many replication slots active before shutdown"),
-				 errhint("Increase max_replication_slots and try again.")));
+				 errhint("Increase max_replication_slots (currently %u) and try again.",
+					 max_replication_slots)));
+}
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * We're in redo, but someone could still create a local
+	 * non-failover slot and race with us unless we take the
+	 * allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning in case there's an existing slot with the same
+		 * name.
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Existing slot with same name? It could be our failover slot
+		 * to update or a non-failover slot with a conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * A local non-failover slot exists with the same name as
+		 * the failover slot we're creating.
+		 *
+		 * Clobber the client, drop its slot, and carry on with
+		 * our business.
+		 *
+		 * First we must temporarily release the allocation lock while
+		 * we try to terminate the process that holds the slot, since
+		 * we don't want to hold the LWlock for ages. We'll reacquire
+		 * it later.
+		 */
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/* We might race with other clients, so retry-loop */
+		do
+		{
+			int active_pid = slot->active_pid;
+			int max_sleep_millis = 120 * 1000;
+			int millis_per_sleep = 1000;
+
+			if (active_pid != 0)
+			{
+				ereport(INFO,
+						(errmsg("terminating active connection by pid %u to local slot %s because of conflict with recovery",
+							active_pid, NameStr(slot->data.name))));
+
+				if (kill(active_pid, SIGTERM))
+					elog(DEBUG1, "failed to signal pid %u to terminate on slot conflict: %m",
+							active_pid);
+
+				/*
+				 * No way to wait for the process since it's not a child
+				 * of ours and there's no latch to set, so poll.
+				 *
+				 * We're checking this without any locks held, but
+				 * we'll recheck when we attempt to drop the slot.
+				 */
+				while (slot->in_use && slot->active_pid == active_pid
+						&& max_sleep_millis > 0)
+				{
+					int rc;
+
+					rc = WaitLatch(MyLatch,
+							WL_TIMEOUT | WL_LATCH_SET | WL_POSTMASTER_DEATH,
+							millis_per_sleep);
+
+					if (rc & WL_POSTMASTER_DEATH)
+						elog(FATAL, "exiting after postmaster termination");
+
+					/*
+					 * Might be shorter if something sets our latch, but
+					 * we don't care much.
+					 */
+					max_sleep_millis -= millis_per_sleep;
+				}
+
+				if (max_sleep_millis <= 0)
+					elog(WARNING, "process %u is taking too long to terminate after SIGTERM",
+							slot->active_pid);
+			}
+
+			if (slot->active_pid == 0)
+			{
+				/* Try to acquire and drop the slot */
+				SpinLockAcquire(&slot->mutex);
+
+				if (slot->active_pid != 0)
+				{
+					/* Lost the race, go around */
+				}
+				else
+				{
+					/* Claim the slot for ourselves */
+					slot->active_pid = MyProcPid;
+					MyReplicationSlot = slot;
+				}
+				SpinLockRelease(&slot->mutex);
+			}
+
+			if (slot->active_pid == MyProcPid)
+			{
+				NameData slotname;
+				strncpy(NameStr(slotname), NameStr(slot->data.name), NAMEDATALEN);
+				(NameStr(slotname))[NAMEDATALEN-1] = '\0';
+
+				/*
+				 * Reclaim the allocation lock and THEN drop the slot,
+				 * so nobody else can grab the name until we've
+				 * finished redo.
+				 */
+				LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+				ReplicationSlotDropAcquired();
+				/* We clobbered the duplicate, treat it as new */
+				found_duplicate = false;
+
+				ereport(WARNING,
+						(errmsg("dropped local replication slot %s because of conflict with recovery",
+								NameStr(slotname)),
+						 errdetail("A failover slot with the same name was created on the master server")));
+			}
+		}
+		while (slot->in_use);
+	}
+
+	Assert(LWLockHeldByMe(ReplicationSlotAllocationLock));
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			Assert(!slot->in_use);
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+
+		ReplicationSlotsUpdateRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredLSN();
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+	{
+		ReplicationSlotDropAcquired();
+	}
+	else
+	{
+		elog(WARNING, "failover slot %s not found during redo of drop",
+				slotname);
+	}
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..f430714 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -57,7 +57,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -120,7 +120,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..1583862 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
 	}
 
 	initStringInfo(&output_message);
@@ -1523,7 +1523,7 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 	}
 
 	/*
@@ -1619,7 +1619,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredXmin(false);
 	}
 }
 
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8be8ab6..cdcbd37 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -11,69 +11,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -155,7 +98,7 @@ extern void ReplicationSlotsShmemInit(void);
 
 /* management of individual slots */
 extern void ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency p);
+					  ReplicationSlotPersistency p, bool failover);
 extern void ReplicationSlotPersist(void);
 extern void ReplicationSlotDrop(const char *name);
 
@@ -167,12 +110,14 @@ extern void ReplicationSlotMarkDirty(void);
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
 extern void ReplicationSlotReserveWal(void);
-extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
-extern void ReplicationSlotsComputeRequiredLSN(void);
+extern void ReplicationSlotsUpdateRequiredXmin(bool already_locked);
+extern void ReplicationSlotsUpdateRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
+extern XLogRecPtr ReplicationSlotsComputeRequiredLSN(bool failover_only);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
-extern void StartupReplicationSlots(void);
+extern void StartupReplicationSlots(bool drop_nonfailover_slots);
 extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..e3211f5
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ */
+#define XLOG_REPLSLOT_UPDATE	0x10
+#define XLOG_REPLSLOT_DROP		0x20
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
-- 
2.1.0

#20Oleksii Kliukin
alexk@hintbits.com
In reply to: Craig Ringer (#19)
Re: WIP: Failover Slots

On 23 Feb 2016, at 11:30, Craig Ringer <craig@2ndquadrant.com> wrote:

Updated patch

... attached

I've split it up a bit more too, so it's easier to tell what change is for what and fixed the issues mentioned by Oleksii. I've also removed some unrelated documentation changes.

Patch 0001, timeline switches for logical decoding, is unchanged since the last post.

Thank you, I read the user-interface part now, looks good to me.

I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));

There are a couple of incorrect comments

logical.c: 90
There's some things missing to allow this: I think it should be “There are some things missing to allow this:”

logical.c:93
"we need we would need”

slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the code below.

Also, slot.c:301 emits an error message for an attempt to create a failover slot on the replica after acquiring and releasing the locks and getting the shared memory slot, even though all the data to check for this condition is available right at the beginning of the function. Shouldn’t it avoid the extra work if it’s not needed?

--
Craig Ringer http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services
<0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>

Kind regards,
--
Oleksii

#21Craig Ringer
craig@2ndquadrant.com
In reply to: Oleksii Kliukin (#20)
Re: WIP: Failover Slots

On 24 February 2016 at 03:53, Oleksii Kliukin <alexk@hintbits.com> wrote:

I found the following issue when shutting down a master with a connected
replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07
CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07
CET,,0,DEBUG,00000,"archived transaction log file
""000000010000000000000003""",,,,,,,,,""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07
CET,,0,PANIC,XX000,"concurrent transaction log activity while database
system is shutting down",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07
CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal
6: Abort trap",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07
CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,

Odd that I didn't see that in my testing. Thanks very much for this. I
concur with your explanation.

Basically, the issue is that CreateCheckPoint calls

CheckpointReplicationSlots, which currently produces WAL, and this violates
the assumption at line xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is
shutting down")));

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots
during a shutdown checkpoint. We're using WAL to move data to the replicas
but we don't really need it for local redo and correctness on the master.
The trouble is that we do of course redo failover slot updates on the
master and we don't really want a slot to go backwards vs its on-disk state
before a crash. That's not too harmful - but might be able to lead to us
losing a slot catalog_xmin increase so the slot thinks catalog is still
readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

* This needn't actually be part of a checkpoint, but it's a convenient
* location.

... and I suspect the answer there is simply to move the slot checkpoint to
occur prior to the WAL checkpoint rather than during it. I'll investigate.

I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.

(BTW, the slot docs promise that slots will replay a change exactly once,
but this is not correct and the client must keep track of replay position.
I'll post a patch to correct it separately).

There are a couple of incorrect comments

Thanks, will amend.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#21)
Re: WIP: Failover Slots

On 24 February 2016 at 18:02, Craig Ringer <craig@2ndquadrant.com> wrote:

I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.

Just an update on the failover slots status: I've moved timeline following
for logical slots into its own patch set and CF entry and added a bunch of
tests.

https://commitfest.postgresql.org/9/488/

Some perl TAP test framework enhancements were needed for that; they're
mostly committed now with a few pending.

https://commitfest.postgresql.org/9/569/

Once some final changes are made to the tests for timeline following I'll
address the checkpoint issue in failover slots by doing the checkpoint of
slots at the start of a checkpoint/restartpoint, while we can still write
WAL. Per the comments in CheckPointReplicationSlots it's mostly done in a
checkpoint currently for convenience.

Then I'll write some TAP tests for failover slots and submit an updated
patch for them, by which time hopefully timeline following for logical
slots will be committed.

In other words this patch isn't dead, the foundations are just being
rebased out from under it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#22)
7 attachment(s)
Re: WIP: Failover Slots

Here's a new failover slots rev, addressing the issues Oleksii Kliukin
raised and adding a bunch of TAP tests.

In particular, for the checkpoint issue I landed up moving
CheckPointReplicationSlots to occur at the start of a checkpoint, before
writing WAL is prohibited. As the comments note it's just a convenient
place and time to do it anyway. That means it has to be called separately
at a restartpoint, but I don't think that's a biggie.

The tests for this took me quite a while, much (much) longer than the code
changes.

I split the patch up a bit more too so individual changes are more
logically grouped and clearer. I expect it'd be mostly or entirely squashed
for commit.

Attachments:

0001-Allow-replication-slots-to-follow-failover.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-replication-slots-to-follow-failover.patchDownload
From 256d43f4c8195c893efeb0319d7642853d15f3a9 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:59:37 +0800
Subject: [PATCH 1/7] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile               |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c         |  65 +++
 src/backend/access/transam/rmgr.c                  |   1 +
 src/backend/access/transam/xlog.c                  |   5 +-
 src/backend/commands/dbcommands.c                  |   3 +
 src/backend/replication/basebackup.c               |  12 -
 src/backend/replication/logical/decode.c           |   1 +
 src/backend/replication/logical/logical.c          |  25 +-
 src/backend/replication/slot.c                     | 586 +++++++++++++++++++--
 src/backend/replication/slotfuncs.c                |   4 +-
 src/backend/replication/walsender.c                |   8 +-
 src/bin/pg_xlogdump/replslotdesc.c                 |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c                     |   1 +
 src/include/access/rmgrlist.h                      |   1 +
 src/include/replication/slot.h                     |  69 +--
 src/include/replication/slot_xlog.h                | 100 ++++
 .../modules/decoding_failover/decoding_failover.c  |   6 +-
 17 files changed, 758 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..5829e8d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "of slot %s with restart %X/%X and xid %u confirmed to %X/%X",
+						NameStr(xlrec->name),
+						(uint32)(xlrec->restart_lsn>>32), (uint32)(xlrec->restart_lsn),
+						xlrec->xmin,
+						(uint32)(xlrec->confirmed_flush>>32), (uint32)(xlrec->confirmed_flush));
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "of slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 94b79ac..a92f09d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6366,8 +6366,11 @@ StartupXLOG(void)
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
+	 *
+	 * If we're in archive recovery then non-failover slots are no
+	 * longer of any use and should be dropped during startup.
 	 */
-	StartupReplicationSlots();
+	StartupReplicationSlots(ArchiveRecoveryRequested);
 
 	/*
 	 * Startup logical state, needs to be setup now so we have proper data
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 56be1ed..948e31f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -135,6 +135,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2e6d3f9..2c7b749 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There are some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
@@ -272,7 +275,7 @@ CreateInitDecodingContext(char *plugin,
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
 
-	ReplicationSlotsComputeRequiredXmin(true);
+	ReplicationSlotsUpdateRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -908,8 +911,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			MyReplicationSlot->effective_catalog_xmin = MyReplicationSlot->data.catalog_xmin;
 			SpinLockRelease(&MyReplicationSlot->mutex);
 
-			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
+			ReplicationSlotsUpdateRequiredXmin(false);
+			ReplicationSlotsUpdateRequiredLSN();
 		}
 	}
 	else
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..d83118d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -24,7 +24,18 @@
  * directory. Inside that directory the state file will contain the slot's
  * own data. Additional data can be stored alongside that file if required.
  * While the server is running, the state data is also cached in memory for
- * efficiency.
+ * efficiency. Non-failover slots are NOT subject to WAL logging and may
+ * be used on standbys (though that's only supported for physical slots at
+ * the moment). They use tempfile writes and swaps for crash safety.
+ *
+ * A failover slot created on a master node generates WAL records that
+ * maintain a copy of the slot on standby nodes. If a standby node is
+ * promoted the failover slot allows access to be restarted just as if the
+ * the original master node was being accessed, allowing for the timeline
+ * change. The replica considers slot positions when removing WAL to make
+ * sure it can satisfy the needs of slots after promotion.  For logical
+ * decoding slots the slot's internal state is kept up to date so it's
+ * ready for use after promotion.
  *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -101,10 +113,14 @@ static LWLockTranche ReplSlotIOLWLockTranche;
 static void ReplicationSlotDropAcquired(void);
 
 /* internal persistency functions */
-static void RestoreSlotFromDisk(const char *name);
+static void RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -220,7 +236,8 @@ ReplicationSlotValidateName(const char *name, int elevel)
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency persistency)
+					  ReplicationSlotPersistency persistency,
+					  bool failover)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -229,6 +246,11 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 	ReplicationSlotValidateName(name, ERROR);
 
+	if (failover && RecoveryInProgress())
+		ereport(ERROR,
+				(errmsg("a failover slot may not be created on a replica"),
+				 errhint("Create the slot on the master server instead")));
+
 	/*
 	 * If some other backend ran this code concurrently with us, we'd likely both
 	 * allocate the same slot, and that would be bad.  We'd also be at risk of
@@ -278,6 +300,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/* Slot timeline is unused and always zero */
+	slot->data.restart_tli = 0;
+	slot->data.failover = failover;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -313,6 +338,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -335,7 +364,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && s->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -349,12 +382,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -411,6 +456,9 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
 ReplicationSlotDropAcquired(void)
@@ -418,9 +466,14 @@ ReplicationSlotDropAcquired(void)
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false,
+		 took_allocation_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -428,8 +481,27 @@ ReplicationSlotDropAcquired(void)
 	 * If some other backend ran this code concurrently with us, we might try
 	 * to delete a slot with a certain name while someone else was trying to
 	 * create a slot with the same name.
+	 *
+	 * If called with the lock already held it MUST be held in
+	 * EXCLUSIVE mode.
 	 */
-	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotAllocationLock))
+	{
+		took_allocation_lock = true;
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	}
+
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), NAMEDATALEN);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
 
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
@@ -459,7 +531,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -477,18 +553,27 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
 	 * limits.
 	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	/*
 	 * If removing the directory fails, the worst thing that will happen is
@@ -504,7 +589,8 @@ ReplicationSlotDropAcquired(void)
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
-	LWLockRelease(ReplicationSlotAllocationLock);
+	if (took_allocation_lock)
+		LWLockRelease(ReplicationSlotAllocationLock);
 }
 
 /*
@@ -544,6 +630,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -565,7 +654,7 @@ ReplicationSlotPersist(void)
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  */
 void
-ReplicationSlotsComputeRequiredXmin(bool already_locked)
+ReplicationSlotsUpdateRequiredXmin(bool already_locked)
 {
 	int			i;
 	TransactionId agg_xmin = InvalidTransactionId;
@@ -610,10 +699,20 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 }
 
 /*
- * Compute the oldest restart LSN across all slots and inform xlog module.
+ * Update the xlog module's copy of the minimum restart lsn across all slots
  */
 void
-ReplicationSlotsComputeRequiredLSN(void)
+ReplicationSlotsUpdateRequiredLSN(void)
+{
+	XLogSetReplicationSlotMinimumLSN(ReplicationSlotsComputeRequiredLSN(false));
+}
+
+/*
+ * Compute the oldest restart LSN across all slots (or optionally
+ * only failover slots) and return it.
+ */
+XLogRecPtr
+ReplicationSlotsComputeRequiredLSN(bool failover_only)
 {
 	int			i;
 	XLogRecPtr	min_required = InvalidXLogRecPtr;
@@ -625,14 +724,19 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		failover;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		failover = s->data.failover;
 		SpinLockRelease(&s->mutex);
 
+		if (failover_only && !failover)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -640,7 +744,7 @@ ReplicationSlotsComputeRequiredLSN(void)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	XLogSetReplicationSlotMinimumLSN(min_required);
+	return min_required;
 }
 
 /*
@@ -649,7 +753,7 @@ ReplicationSlotsComputeRequiredLSN(void)
  * Returns InvalidXLogRecPtr if logical decoding is disabled or no logical
  * slots exist.
  *
- * NB: this returns a value >= ReplicationSlotsComputeRequiredLSN(), since it
+ * NB: this returns a value >= ReplicationSlotsUpdateRequiredLSN(), since it
  * ignores physical replication slots.
  *
  * The results aren't required frequently, so we don't maintain a precomputed
@@ -747,6 +851,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -779,12 +922,13 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
+	 * ReplicationSlotsUpdateRequiredLSN() is used. In the unlikely case that
 	 * this happens we'll just retry.
 	 */
 	while (true)
@@ -821,12 +965,12 @@ ReplicationSlotReserveWal(void)
 		}
 
 		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
+		 * concurrent ReplicationSlotsUpdateRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
 		 * more than twice.
 		 */
@@ -878,7 +1022,7 @@ CheckPointReplicationSlots(void)
  * needs to be run before we start crash recovery.
  */
 void
-StartupReplicationSlots(void)
+StartupReplicationSlots(bool drop_nonfailover_slots)
 {
 	DIR		   *replication_dir;
 	struct dirent *replication_de;
@@ -917,7 +1061,7 @@ StartupReplicationSlots(void)
 		}
 
 		/* looks like a slot in a normal state, restore */
-		RestoreSlotFromDisk(replication_de->d_name);
+		RestoreSlotFromDisk(replication_de->d_name, drop_nonfailover_slots);
 	}
 	FreeDir(replication_dir);
 
@@ -926,8 +1070,8 @@ StartupReplicationSlots(void)
 		return;
 
 	/* Now that we have recovered all the data, compute replication xmin */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 }
 
 /* ----
@@ -996,6 +1140,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -1006,15 +1152,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1047,6 +1196,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1116,7 +1284,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
  * Load a single slot from disk into memory.
  */
 static void
-RestoreSlotFromDisk(const char *name)
+RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots)
 {
 	ReplicationSlotOnDisk cp;
 	int			i;
@@ -1235,10 +1403,21 @@ RestoreSlotFromDisk(const char *name)
 						path, checksum, cp.checksum)));
 
 	/*
-	 * If we crashed with an ephemeral slot active, don't restore but delete
-	 * it.
+	 * If we crashed with an ephemeral slot active, don't restore but
+	 * delete it.
+	 *
+	 * Similarly, if we're in archive recovery and will be running as
+	 * a standby (when drop_nonfailover_slots is set), non-failover
+	 * slots can't be relied upon. Logical slots might have a catalog
+	 * xmin lower than reality because the original slot on the master
+	 * advanced past the point the stale slot on the replica is stuck
+	 * at. Additionally slots might have been copied while being
+	 * written to if the basebackup copy method was not atomic.
+	 * Failover slots are safe since they're WAL-logged and follow the
+	 * master's slot position.
 	 */
-	if (cp.slotdata.persistency != RS_PERSISTENT)
+	if (cp.slotdata.persistency != RS_PERSISTENT
+			|| (drop_nonfailover_slots && !cp.slotdata.failover))
 	{
 		sprintf(path, "pg_replslot/%s", name);
 
@@ -1249,6 +1428,14 @@ RestoreSlotFromDisk(const char *name)
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
 		fsync_fname("pg_replslot", true);
+
+		if (cp.slotdata.persistency == RS_PERSISTENT)
+		{
+			ereport(LOG,
+					(errmsg("dropped non-failover slot %s during archive recovery",
+							 NameStr(cp.slotdata.name))));
+		}
+
 		return;
 	}
 
@@ -1285,5 +1472,332 @@ RestoreSlotFromDisk(const char *name)
 	if (!restored)
 		ereport(PANIC,
 				(errmsg("too many replication slots active before shutdown"),
-				 errhint("Increase max_replication_slots and try again.")));
+				 errhint("Increase max_replication_slots (currently %u) and try again.",
+					 max_replication_slots)));
+}
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * We're in redo, but someone could still create a local
+	 * non-failover slot and race with us unless we take the
+	 * allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning in case there's an existing slot with the same
+		 * name.
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Existing slot with same name? It could be our failover slot
+		 * to update or a non-failover slot with a conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * A local non-failover slot exists with the same name as
+		 * the failover slot we're creating.
+		 *
+		 * Clobber the client, drop its slot, and carry on with
+		 * our business.
+		 *
+		 * First we must temporarily release the allocation lock while
+		 * we try to terminate the process that holds the slot, since
+		 * we don't want to hold the LWlock for ages. We'll reacquire
+		 * it later.
+		 */
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/* We might race with other clients, so retry-loop */
+		do
+		{
+			int active_pid = slot->active_pid;
+			int max_sleep_millis = 120 * 1000;
+			int millis_per_sleep = 1000;
+
+			if (active_pid != 0)
+			{
+				ereport(INFO,
+						(errmsg("terminating active connection by pid %u to local slot %s because of conflict with recovery",
+							active_pid, NameStr(slot->data.name))));
+
+				if (kill(active_pid, SIGTERM))
+					elog(DEBUG1, "failed to signal pid %u to terminate on slot conflict: %m",
+							active_pid);
+
+				/*
+				 * Wait for the process using the slot to die. This just uses the
+				 * latch to poll; the process won't set our latch when it releases
+				 * the slot and dies.
+				 *
+				 * We're checking active_pid without any locks held, but we'll
+				 * recheck when we attempt to drop the slot.
+				 */
+				while (slot->in_use && slot->active_pid == active_pid
+						&& max_sleep_millis > 0)
+				{
+					int rc;
+
+					rc = WaitLatch(MyLatch,
+							WL_TIMEOUT | WL_LATCH_SET | WL_POSTMASTER_DEATH,
+							millis_per_sleep);
+
+					if (rc & WL_POSTMASTER_DEATH)
+						elog(FATAL, "exiting after postmaster termination");
+
+					/*
+					 * Might be shorter if something sets our latch, but
+					 * we don't care much.
+					 */
+					max_sleep_millis -= millis_per_sleep;
+				}
+
+				if (max_sleep_millis <= 0)
+					elog(WARNING, "process %u is taking too long to terminate after SIGTERM",
+							slot->active_pid);
+			}
+
+			if (slot->active_pid == 0)
+			{
+				/* Try to acquire and drop the slot */
+				SpinLockAcquire(&slot->mutex);
+
+				if (slot->active_pid != 0)
+				{
+					/* Lost the race, go around */
+				}
+				else
+				{
+					/* Claim the slot for ourselves */
+					slot->active_pid = MyProcPid;
+					MyReplicationSlot = slot;
+				}
+				SpinLockRelease(&slot->mutex);
+			}
+
+			if (slot->active_pid == MyProcPid)
+			{
+				NameData slotname;
+				strncpy(NameStr(slotname), NameStr(slot->data.name), NAMEDATALEN);
+				(NameStr(slotname))[NAMEDATALEN-1] = '\0';
+
+				/*
+				 * Reclaim the allocation lock and THEN drop the slot,
+				 * so nobody else can grab the name until we've
+				 * finished redo.
+				 */
+				LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+				ReplicationSlotDropAcquired();
+				/* We clobbered the duplicate, treat it as new */
+				found_duplicate = false;
+
+				ereport(WARNING,
+						(errmsg("dropped local replication slot %s because of conflict with recovery",
+								NameStr(slotname)),
+						 errdetail("A failover slot with the same name was created on the master server")));
+			}
+		}
+		while (slot->in_use);
+	}
+
+	Assert(LWLockHeldByMe(ReplicationSlotAllocationLock));
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			Assert(!slot->in_use);
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+
+		ReplicationSlotsUpdateRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredLSN();
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+	{
+		ReplicationSlotDropAcquired();
+	}
+	else
+	{
+		elog(WARNING, "failover slot %s not found during redo of drop",
+				slotname);
+	}
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..f430714 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -57,7 +57,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -120,7 +120,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..1583862 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
 	}
 
 	initStringInfo(&output_message);
@@ -1523,7 +1523,7 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 	}
 
 	/*
@@ -1619,7 +1619,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredXmin(false);
 	}
 }
 
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8be8ab6..cdcbd37 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -11,69 +11,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -155,7 +98,7 @@ extern void ReplicationSlotsShmemInit(void);
 
 /* management of individual slots */
 extern void ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency p);
+					  ReplicationSlotPersistency p, bool failover);
 extern void ReplicationSlotPersist(void);
 extern void ReplicationSlotDrop(const char *name);
 
@@ -167,12 +110,14 @@ extern void ReplicationSlotMarkDirty(void);
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
 extern void ReplicationSlotReserveWal(void);
-extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
-extern void ReplicationSlotsComputeRequiredLSN(void);
+extern void ReplicationSlotsUpdateRequiredXmin(bool already_locked);
+extern void ReplicationSlotsUpdateRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
+extern XLogRecPtr ReplicationSlotsComputeRequiredLSN(bool failover_only);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
-extern void StartupReplicationSlots(void);
+extern void StartupReplicationSlots(bool drop_nonfailover_slots);
 extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..e3211f5
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ */
+#define XLOG_REPLSLOT_UPDATE	0x10
+#define XLOG_REPLSLOT_DROP		0x20
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
diff --git a/src/test/modules/decoding_failover/decoding_failover.c b/src/test/modules/decoding_failover/decoding_failover.c
index bab0f3b..8fcfda5 100644
--- a/src/test/modules/decoding_failover/decoding_failover.c
+++ b/src/test/modules/decoding_failover/decoding_failover.c
@@ -37,7 +37,7 @@ decoding_failover_create_logical_slot(PG_FUNCTION_ARGS)
 
 	CheckSlotRequirements();
 
-	ReplicationSlotCreate(slotname, true, RS_PERSISTENT);
+	ReplicationSlotCreate(slotname, true, RS_PERSISTENT, false);
 
 	/* register the plugin name with the slot */
 	StrNCpy(NameStr(MyReplicationSlot->data.plugin), plugin, NAMEDATALEN);
@@ -99,8 +99,8 @@ decoding_failover_advance_logical_slot(PG_FUNCTION_ARGS)
 	ReplicationSlotSave();
 	ReplicationSlotRelease();
 
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	PG_RETURN_VOID();
 }
-- 
2.1.0

0002-Update-decoding_failover-tests-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0002-Update-decoding_failover-tests-for-failover-slots.patchDownload
From eef34447d9b69c32cd6da5116f24cd628370d4a9 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 8 Mar 2016 14:34:36 +0800
Subject: [PATCH 2/7] Update decoding_failover tests for failover slots

---
 .../recovery/t/006_logical_decoding_timelines.pl   | 29 +++++++++-------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/src/test/recovery/t/006_logical_decoding_timelines.pl b/src/test/recovery/t/006_logical_decoding_timelines.pl
index 1372d90..ed6cac7 100644
--- a/src/test/recovery/t/006_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/006_logical_decoding_timelines.pl
@@ -19,7 +19,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 20;
+use Test::More tests => 19;
 use RecursiveCopy;
 use File::Copy;
 
@@ -64,7 +64,7 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 # Verify that only the before base_backup slot is on the replica
 $stdout = $node_replica->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
-is($stdout, 'before_basebackup', 'Expected to find only slot before_basebackup on replica');
+is($stdout, '', 'Expected to find no slots on replica');
 
 # Boom, crash
 $node_master->stop('immediate');
@@ -86,22 +86,16 @@ like(
 	qr/replication slot "after_basebackup" does not exist/,
 	'after_basebackup slot missing');
 
-# Should be able to read from slot created before base backup
+# or before_basebackup, since pg_basebackup dropped it
 ($ret, $stdout, $stderr) = $node_replica->psql(
 	'postgres',
 "SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
 	timeout => 30);
-is($ret, 0, 'replay from slot before_basebackup succeeds');
-is( $stdout, q(BEGIN
-table public.decoding: INSERT: blah[text]:'beforebb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'afterbb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'after failover'
-COMMIT), 'decoded expected data from slot before_basebackup');
-is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+is($ret, 3, 'replaying from beforebasebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "before_basebackup" does not exist/,
+	'before_basebackup slot missing');
 
 # We don't need the standby anymore
 $node_replica->teardown_node();
@@ -121,9 +115,10 @@ is($node_master->psql('postgres', 'SELECT pg_drop_replication_slot(slot_name) FR
   0, 'dropping slots succeeds via pg_drop_replication_slot');
 
 # Same as before, we'll make one slot before basebackup, one after. This time
-# the basebackup will be with pg_basebackup so it'll omit both slots, then
-# we'll use SQL functions provided by the decoding_failover test module to
-# sync them to the replica, do some work, sync them and fail over then test
+# the basebackup will be with pg_basebackup. It'll copy the before_basebackup slot
+# but since it's a non-failover slot the server will drop it immediately.
+# We'll use SQL functions provided by the decoding_failover test module to
+# sync both slots to the replica, do some work, sync them and fail over then test
 # again. This time we should have both the before- and after-basebackup
 # slots working.
 
-- 
2.1.0

0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchDownload
From 73b9e5827f6e590e5c558f36ce0962f3bdecd2ad Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:00:09 +0800
Subject: [PATCH 3/7] Retain extra WAL for failover slots in base backups

Change the return value of pg_start_backup(), the BASE_BACKUP walsender
command, etc to report the minimum WAL required by any failover slot if
this is a lower LSN than the redo position so that base backups contain
the WAL required for slots to work.

Add a new backup label entry 'MIN FAILOVER SLOT LSN' that, if present,
indicates the minimum LSN needed by any failover slot that is present in
the base backup. Backup tools should check for this entry and ensure
they retain all xlogs including and after that point.
---
 src/backend/access/transam/xlog.c | 41 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a92f09d..9018af5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9797,6 +9797,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	XLogRecPtr  slot_startpoint;
 	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -9943,6 +9944,17 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
+			/*
+			 * If failover slots are in use we must retain and transfer WAL
+			 * older than the redo location so that those slots can be replayed
+			 * from after a failover event.
+			 *
+			 * This MUST be at an xlog segment boundary so truncate the LSN
+			 * appropriately.
+			 */
+			if (max_replication_slots > 0)
+				slot_startpoint = (ReplicationSlotsComputeRequiredLSN(true)/ XLOG_SEG_SIZE) * XLOG_SEG_SIZE;
+
 			if (backup_started_in_recovery)
 			{
 				XLogRecPtr	recptr;
@@ -10111,6 +10123,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 						 backup_started_in_recovery ? "standby" : "master");
 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+		if (slot_startpoint != InvalidXLogRecPtr)
+			appendStringInfo(&labelfbuf,  "MIN FAILOVER SLOT LSN: %X/%X\n",
+						(uint32)(slot_startpoint>>32), (uint32)slot_startpoint);
+
 
 		/*
 		 * Okay, write the file, or return its contents to caller.
@@ -10204,9 +10220,34 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
+	 *
+	 * pg_basebackup etc expect to use this as the position to start copying
+	 * WAL from, so we should return the minimum of the slot start LSN and the
+	 * current redo position to make sure we get all WAL required by failover
+	 * slots.
+	 *
+	 * The min required LSN for failover slots is also available from the
+	 * 'MIN FAILOVER SLOT LSN' entry in the backup label file.
 	 */
+	if (slot_startpoint != InvalidXLogRecPtr && slot_startpoint < startpoint)
+	{
+		List *history;
+		TimeLineID slot_start_tli;
+
+		/* Min LSN required by a slot may be on an older timeline. */
+		history = readTimeLineHistory(ThisTimeLineID);
+		slot_start_tli = tliOfPointInHistory(slot_startpoint, history);
+		list_free_deep(history);
+
+		if (slot_start_tli < starttli)
+			starttli = slot_start_tli;
+
+		startpoint = slot_startpoint;
+	}
+
 	if (starttli_p)
 		*starttli_p = starttli;
+
 	return startpoint;
 }
 
-- 
2.1.0

0004-Add-the-UI-and-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0004-Add-the-UI-and-for-failover-slots.patchDownload
From 52f07aa03ebde429cf3dccbe21bc6fa8e59eacc2 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:04:05 +0800
Subject: [PATCH 4/7] Add the UI and for failover slots

Expose failover slots to the user.

Add a new 'failover' argument to pg_create_logical_replication_slot and
pg_create_physical_replication_slot . Accept a new FAILOVER keyword
argument in CREATE_REPLICATION_SLOT on the walsender protocol.
---
 contrib/test_decoding/expected/ddl.out |  3 +++
 contrib/test_decoding/sql/ddl.sql      |  2 ++
 src/backend/catalog/system_views.sql   | 11 ++++++++++-
 src/backend/replication/repl_gram.y    | 13 +++++++++++--
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/slotfuncs.c    |  7 +++++--
 src/backend/replication/walsender.c    |  4 ++--
 src/include/catalog/pg_proc.h          |  4 ++--
 src/include/nodes/replnodes.h          |  1 +
 src/include/replication/slot.h         |  1 +
 10 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 57a1289..5fed500 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -9,6 +9,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 ERROR:  replication slot "regression_slot" already exists
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
+ERROR:  replication slot "regression_slot" already exists
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 ERROR:  replication slot name "Invalid Name" contains invalid character
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index e311c59..dc61ef4 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -4,6 +4,8 @@ SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index abf9a70..fcb877d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -949,12 +949,21 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
     IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-    OUT slot_name name, OUT xlog_position pg_lsn)
+    IN failover boolean DEFAULT false, OUT slot_name name,
+    OUT xlog_position pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_physical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot(
+    IN slot_name name, IN plugin name, IN failover boolean DEFAULT false,
+    OUT slot_name text, OUT xlog_position pg_lsn)
+RETURNS RECORD
+LANGUAGE INTERNAL
+STRICT VOLATILE
+AS 'pg_create_logical_replication_slot';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d93db88..1574f24 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_FAILOVER
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_failover
 
 %%
 
@@ -184,23 +186,25 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->reserve_wal = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +280,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_failover:
+			K_FAILOVER						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index f83ec53..a1d9f10 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -98,6 +98,7 @@ PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
+FAILOVER			{ return K_FAILOVER; }
 
 ","				{ return ','; }
 ";"				{ return ';'; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f430714..a2dfc40 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
@@ -41,6 +42,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	bool 		immediately_reserve = PG_GETARG_BOOL(1);
+	bool		failover = PG_GETARG_BOOL(2);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
@@ -57,7 +59,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, failover);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -96,6 +98,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	Name		plugin = PG_GETARG_NAME(1);
+	bool		failover = PG_GETARG_BOOL(2);
 
 	LogicalDecodingContext *ctx = NULL;
 
@@ -120,7 +123,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, failover);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1583862..efdbfd1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -792,7 +792,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, cmd->failover);
 	}
 	else
 	{
@@ -803,7 +803,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, cmd->failover);
 	}
 
 	initStringInfo(&output_message);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index aec6c4c..e7247af 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5077,13 +5077,13 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 /* replication slots */
-DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16" "{19,16,19,3220}" "{i,i,o,o}" "{slot_name,immediately_reserve,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 16 16" "{19,16,16,19,3220}" "{i,i,i,o,o}" "{slot_name,immediately_reserve,failover,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
 DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
-DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19" "{19,19,25,3220}" "{i,i,o,o}" "{slot_name,plugin,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
 DATA(insert OID = 3782 (  pg_logical_slot_get_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v u 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,25}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ _null_ pg_logical_slot_get_changes _null_ _null_ _null_ ));
 DESCR("get changes from replication slot");
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..a8fa9d5 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		failover;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index cdcbd37..9e23a29 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
-- 
2.1.0

0005-Document-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0005-Document-failover-slots.patchDownload
From 47f8bd5ecfe896824c9e51f100c47795a55ce601 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:31:13 +0800
Subject: [PATCH 5/7] Document failover slots

---
 doc/src/sgml/func.sgml              | 15 +++++++++-----
 doc/src/sgml/high-availability.sgml | 41 +++++++++++++++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml   |  2 +-
 doc/src/sgml/protocol.sgml          | 19 ++++++++++++++++-
 4 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c0b94bc..649a0c2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17449,7 +17449,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>, <parameter>immediately_reserve</> <type>boolean</> </optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <optional><parameter>immediately_reserve</> <type>boolean</></optional>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17460,7 +17460,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         when <literal>true</>, specifies that the <acronym>LSN</> for this
         replication slot be reserved immediately; otherwise
         the <acronym>LSN</> is reserved on first connection from a streaming
-        replication client. Streaming changes from a physical slot is only
+        replication client. If <literal>failover</literal> is <literal>true</literal>
+        then the slot is created as a failover slot; see <xref
+        linkend="streaming-replication-slots-failover">.
+        Streaming changes from a physical slot is only
         possible with the streaming-replication protocol &mdash;
         see <xref linkend="protocol-replication">. This function corresponds
         to the replication protocol command <literal>CREATE_REPLICATION_SLOT
@@ -17489,7 +17492,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_logical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>)</function></literal>
+        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17497,8 +17500,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new logical (decoding) replication slot named
         <parameter>slot_name</parameter> using the output plugin
-        <parameter>plugin</parameter>.  A call to this function has the same
-        effect as the replication protocol command
+        <parameter>plugin</parameter>. If <literal>failover</literal>
+        is <literal>true</literal> the slot is created as a failover
+        slot; see <xref linkend="streaming-replication-slots-failover">. A call to
+        this function has the same effect as the replication protocol command
         <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
        </entry>
       </row>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6cb690c..4b75175 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -949,6 +949,47 @@ primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
+
+   <sect3 id="streaming-replication-slots-failover" xreflabel="Failover slots">
+     <title>Failover slots</title>
+
+     <para>
+      Normally a replication slot is not preserved across backup and restore
+      (such as by <application>pg_basebackup</application>) and is not
+      replicated to standbys. Slots are <emphasis>automatically
+      dropped</emphasis> when starting up as a streaming replica or in archive
+      recovery (PITR) mode.
+     </para>
+
+     <para>
+      To make it possible to for an application to consistently follow
+      failover when a replica is promoted to a new master a slot may be
+      created as a <emphasis>failover slot</emphasis>. A failover slot may
+      only be created, replayed from or dropped on a master server. Changes to
+      the slot are written to WAL and replicated to standbys. When a standby
+      is promoted applications may connect to the slot on the standby and
+      resume replay from it at a consistent point, as if it was the original
+      master. Failover slots may not be used to replay from a standby before
+      promotion.
+     </para>
+
+     <para>
+      Non-failover slots may be created on and used from a replica. This is
+      currently limited to physical slots as logical decoding is not supported
+      on replica server.
+     </para>
+
+     <para>
+      When a failover slot created on the master has the same name as a
+      non-failover slot on a replica server the non-failover slot will be
+      automatically dropped. Any client currently connected will be
+      disconnected with an error indicating a conflict with recovery. It
+      is strongly recommended that you avoid creating failover slots with
+      the same name as slots on replicas.
+     </para>
+
+   </sect3>
+
   </sect2>
 
   <sect2 id="cascading-replication">
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index e841348..c7b43ed 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -280,7 +280,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     The commands
     <itemizedlist>
      <listitem>
-      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable></literal></para>
+      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable> <optional>FAILOVER</optional></literal></para>
      </listitem>
 
      <listitem>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 522128e..33b6830 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1434,7 +1434,7 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> <optional><literal>RESERVE_WAL</></> | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> } <optional><literal>FAILOVER</></>
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1474,6 +1474,17 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>FAILOVER</></term>
+       <listitem>
+        <para>
+         Create this slot as a <link linkend="streaming-replication-slots-failover">
+         failover slot</link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </listitem>
   </varlistentry>
@@ -1829,6 +1840,12 @@ The commands accepted in walsender mode are:
       to process the output for streaming.
      </para>
 
+     <para>
+      Logical replication automatically follows timeline switches. It is
+      not necessary or possible to supply a <literal>TIMELINE</literal>
+      option like in physical replication.
+     </para>
+
      <variablelist>
       <varlistentry>
        <term><literal>SLOT</literal> <replaceable class="parameter">slot_name</></term>
-- 
2.1.0

0006-Add-failover-to-pg_replication_slots.patchtext/x-patch; charset=US-ASCII; name=0006-Add-failover-to-pg_replication_slots.patchDownload
From 0a64990cf0b89ab29f64c46c2636e32dc37258fd Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:55:01 +0800
Subject: [PATCH 6/7] Add 'failover' to pg_replication_slots

---
 contrib/test_decoding/expected/ddl.out | 38 ++++++++++++++++++++++++++++------
 contrib/test_decoding/sql/ddl.sql      | 15 ++++++++++++--
 doc/src/sgml/catalogs.sgml             | 10 +++++++++
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/replication/slotfuncs.c    |  6 +++++-
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  3 ++-
 7 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 5fed500..5b2f34a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -61,11 +61,37 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
-    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal 
------------------+---------------+-----------+--------+------------------+-------------------+----------
- regression_slot | test_decoding | logical   | f      | t                | t                 | t
+    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+-----------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ regression_slot | test_decoding | logical   | f      | t                | t                 | t        | f
+(1 row)
+
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+ ?column? 
+----------
+ init
+(1 row)
+
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+   slot_name   |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+---------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ failover_slot | test_decoding | logical   | f      | t                | t                 | t        | t
+(1 row)
+
+SELECT pg_drop_replication_slot('failover_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
 (1 row)
 
 /*
@@ -676,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | active | active_pid | failover | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
+-----------+--------+-----------+--------+----------+--------+------------+----------+------+--------------+-------------+---------------------
 (0 rows)
 
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index dc61ef4..f64b21c 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -24,16 +24,27 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 SELECT pg_drop_replication_slot('repl');
 
-
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
 /* check whether status function reports us, only reproduceable columns */
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
 
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+SELECT pg_drop_replication_slot('failover_slot');
+
 /*
  * Check that changes are handled correctly when interleaved with ddl
  */
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..0a3af1f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5377,6 +5377,16 @@
      </row>
 
      <row>
+      <entry><structfield>failover</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       True if this slot is a failover slot; see
+       <xref linkend="streaming-replication-slots-failover"/>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>xmin</structfield></entry>
       <entry><type>xid</type></entry>
       <entry></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fcb877d..26c02e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -704,6 +704,7 @@ CREATE VIEW pg_replication_slots AS
             D.datname AS database,
             L.active,
             L.active_pid,
+            L.failover,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index a2dfc40..abc450d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -177,7 +177,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 10
+#define PG_GET_REPLICATION_SLOTS_COLS 11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -227,6 +227,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		XLogRecPtr	restart_lsn;
 		XLogRecPtr	confirmed_flush_lsn;
 		pid_t		active_pid;
+		bool		failover;
 		Oid			database;
 		NameData	slot_name;
 		NameData	plugin;
@@ -249,6 +250,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			namecpy(&plugin, &slot->data.plugin);
 
 			active_pid = slot->active_pid;
+			failover = slot->data.failover;
 		}
 		SpinLockRelease(&slot->mutex);
 
@@ -279,6 +281,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
+		values[i++] = BoolGetDatum(failover);
+
 		if (xmin != InvalidTransactionId)
 			values[i++] = TransactionIdGetDatum(xmin);
 		else
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index e7247af..836db85 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5081,7 +5081,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
+DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,16,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,failover,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 81bc5c9..d8315c6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1417,11 +1417,12 @@ pg_replication_slots| SELECT l.slot_name,
     d.datname AS database,
     l.active,
     l.active_pid,
+    l.failover,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
     l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, failover, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.1.0

0007-Introduce-TAP-recovery-tests-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0007-Introduce-TAP-recovery-tests-for-failover-slots.patchDownload
From 2b5af6a1e1a73b614057b8a6b9e1e1d822b7baa8 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 10 Mar 2016 10:50:59 +0800
Subject: [PATCH 7/7] Introduce TAP recovery tests for failover slots

---
 src/test/recovery/t/007_failover_slots.pl | 367 ++++++++++++++++++++++++++++++
 1 file changed, 367 insertions(+)
 create mode 100644 src/test/recovery/t/007_failover_slots.pl

diff --git a/src/test/recovery/t/007_failover_slots.pl b/src/test/recovery/t/007_failover_slots.pl
new file mode 100644
index 0000000..8524e20
--- /dev/null
+++ b/src/test/recovery/t/007_failover_slots.pl
@@ -0,0 +1,367 @@
+#
+# Test failover slots
+#
+use strict;
+use warnings;
+use bigint;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use RecursiveCopy;
+use File::Copy;
+use File::Basename qw(basename);
+use List::Util qw();
+use Data::Dumper;
+
+use Carp 'verbose';
+$SIG{ __DIE__ } = sub { Carp::confess( @_ ) };
+
+sub lsn_to_bigint
+{
+	my ($lsn) = @_;
+	my ($high, $low) = split("/",$lsn);
+	return hex($high) * 2**32 + hex($low);
+}
+
+sub get_slot_info
+{
+	my ($node, $slot_name) = @_;
+
+	my $esc_slot_name = $slot_name;
+	$esc_slot_name =~ s/'/''/g;
+	my @selectlist = ('slot_name', 'plugin', 'slot_type', 'database', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn', 'confirmed_flush_lsn');
+	my $row = $node->safe_psql('postgres', "SELECT " . join(', ', @selectlist) . " FROM pg_replication_slots WHERE slot_name = '$esc_slot_name';",
+		extra_params => ['-z']);
+	chomp $row;
+	my @fields = split("\0", $row);
+	if (scalar @fields != scalar @selectlist)
+	{
+		die "Select-list '@selectlist' didn't match length of result-list '@fields'";
+	}
+	my %slotinfo;
+	for (my $i = 0; $i < scalar @selectlist; $i++)
+	{
+		$slotinfo{$selectlist[$i]} = $fields[$i];
+	}
+	return \%slotinfo;
+}
+
+sub diag_slotinfo
+{
+	my ($info, $msg) = @_;
+	diag "slot " . $info->{slot_name} . ": " . Dumper($info);
+}
+
+sub wait_for_catchup
+{
+	my ($node_master, $node_replica) = @_;
+
+	my $master_lsn = $node_master->safe_psql('postgres', 'SELECT pg_current_xlog_insert_location()');
+	diag "waiting for " . $node_replica->name . " to catch up to $master_lsn on " . $node_master->name;
+	my $ret = $node_replica->poll_query_until('postgres',
+		"SELECT pg_last_xlog_replay_location() >= '$master_lsn'::pg_lsn;");
+	BAIL_OUT('replica failed to catch up') unless $ret;
+	my $replica_lsn = $node_replica->safe_psql('postgres', 'SELECT pg_last_xlog_replay_location()');
+	diag "Replica is caught up to $replica_lsn, past required LSN $master_lsn";
+}
+
+sub read_slot_updates_from_xlog
+{
+	my ($node, $timeline) = @_;
+	my ($stdout, $stderr) = ('', '');
+	# Look at master xlogs and examine sequence advances
+	my $wal_pattern = sprintf("%s/pg_xlog/%08X" . ("?" x 16), $node->data_dir, $timeline);
+	my @wal = glob $wal_pattern;
+	my $firstwal = List::Util::minstr(@wal);
+	my $lastwal = basename(List::Util::maxstr(@wal));
+	diag "decoding xlog on " . $node->name . " from $firstwal to $lastwal";
+	IPC::Run::run ['pg_xlogdump', $firstwal, $lastwal], '>', \$stdout, '2>', \$stderr;
+	like($stderr, qr/invalid record length at [0-9A-F]+\/[0-9A-F]+: wanted 24, got 0/,
+		'pg_xlogdump exits with expected error');
+	my @slots = grep(/ReplicationSlot/, split(/\n/, $stdout));
+
+	# Parse the dumped xlog data
+	my @slot_updates = ();
+	for my $slot (@slots) {
+		if (my @matches = ($slot =~ /lsn: ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}), prev [[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}, desc: UPDATE of slot (\w+) with restart ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}) and xid ([[:digit:]]+) confirmed to ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8})/))
+		{
+			my %slot_update = (
+				action => 'UPDATE',
+				log_lsn => $1, slot_name => $2, restart_lsn => $3,
+				xid => $4, confirm_lsn => $5
+				);
+			diag "Replication slot create/advance: $slot_update{slot_name} advanced to $slot_update{confirm_lsn} with restart $slot_update{restart_lsn} and $slot_update{xid} in xlog entry $slot_update{log_lsn}";
+			push @slot_updates, \%slot_update;
+		}
+		elsif ($slot =~ /DELETE/)
+		{
+			diag "Replication slot delete: $slot";
+		}
+		else
+		{
+			die "Slot xlog entry didn't match pattern: $slot";
+		}
+	}
+	return \@slot_updates;
+}
+
+sub check_slot_wal_update
+{
+	my ($entry, $slotname, %params) = @_;
+
+	ok(defined($entry), 'xlog entry exists for slot $slotname');
+	SKIP: {
+		skip 'Expected xlog entry was undef' unless defined($entry);
+		my %entry = %{$entry}; undef $entry;
+		diag "Examining decoded slot update xlog entry: " . Dumper(\%entry);
+		is($entry{action}, 'UPDATE', "action is an update");
+		is($entry{slot_name}, $slotname, "action affects slot " . $slotname);
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "le",
+		       lsn_to_bigint($entry{log_lsn}),
+		       "restart_lsn is no greater than LSN when logged");
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "le",
+		       lsn_to_bigint($entry{log_lsn}),
+		       "confirm_lsn is no greater than LSN when logged");
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "ge",
+			lsn_to_bigint($entry{restart_lsn}),
+			'confirm_lsn equal to or ahead of restart_lsn');
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "le",
+			lsn_to_bigint($params{expect_max_restart_lsn}),
+			'restart_lsn is at or before expected')
+			if ($params{expect_max_restart_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "ge",
+			lsn_to_bigint($params{expect_min_restart_lsn}),
+			'restart_lsn is at or after expected')
+			if ($params{expect_min_restart_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "le",
+			lsn_to_bigint($params{expect_max_confirm_lsn}),
+			'confirm_lsn is at or before expected')
+			if ($params{expect_max_confirm_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "ge",
+			lsn_to_bigint($params{expect_min_confirm_lsn}),
+			'confirm_lsn is at or after expected')
+			if ($params{expect_min_confirm_lsn});
+	}
+}
+
+sub test_read_from_slot
+{
+	my ($node, $slot, $expected) = @_;
+	my $slot_quoted = $slot;
+	$slot_quoted =~ s/'/''/g;
+	my ($ret, $stdout, $stderr) = $node->psql('postgres',
+		"SELECT data FROM pg_logical_slot_peek_changes('$slot_quoted', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+	);
+	is($ret, 0, "replaying from slot $slot is successful");
+	is($stderr, '', "replay from slot $slot produces no stderr");
+	if (defined($expected)) {
+		is($stdout, $expected, "slot $slot returned expected output");
+	}
+	return $stderr;
+}
+
+sub wait_for_end_of_recovery
+{
+	my ($node) = @_;
+	$node->poll_query_until('postgres',
+		"SELECT NOT pg_is_in_recovery();");
+}
+
+diag "";
+
+
+
+my ($stdout, $stderr, $ret, $slotinfo);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 2\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 2\n");
+$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug3'\n");
+$node_master->dump_info;
+$node_master->start;
+
+my $master_beforecreate_bb_lsn = $node_master->safe_psql('postgres',
+	"SELECT pg_current_xlog_insert_location()");
+
+diag "master LSN is $master_beforecreate_bb_lsn before creation of bb_failover";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('bb_failover', 'test_decoding', true);"
+);
+my $bb_beforeconsume_si = get_slot_info($node_master, 'bb_failover');
+diag_slotinfo $bb_beforeconsume_si, 'bb_beforeconsume';
+
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('consumed');");
+($ret, $stdout, $stderr) = $node_master->psql('postgres',
+	"SELECT data FROM pg_logical_slot_get_changes('bb_failover', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is($ret, 0, 'replaying from bb_failover on master is successful');
+is( $stdout, q(BEGIN
+table public.decoding: INSERT: blah[text]:'consumed'
+COMMIT), 'decoded expected data from slot bb_failover on master');
+is($stderr, '', 'replay from slot bb_failover produces no stderr');
+
+my $bb_afterconsume_si = get_slot_info($node_master, 'bb_failover');
+diag_slotinfo $bb_afterconsume_si, 'bb_afterconsume';
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres',
+	"SELECT data FROM pg_logical_slot_get_changes('bb_failover', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is ($ret, 0, 'no error reading empty slot changes after get');
+is ($stdout, '', 'no new changes to read from slot after get');
+
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+my $master_beforecreate_ab_lsn = $node_master->safe_psql('postgres',
+	"SELECT pg_current_xlog_insert_location()");
+
+diag "master LSN is $master_beforecreate_ab_lsn before creation of ab_failover";
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('ab_failover', 'test_decoding', true);"
+);
+
+my $ab_beforeconsume_si = get_slot_info($node_master, 'ab_failover');
+diag_slotinfo $ab_beforeconsume_si, 'ab_beforeconsume';
+
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+
+wait_for_catchup($node_master, $node_replica);
+
+$stdout = $node_master->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, "ab_failover\nbb_failover", 'Both failover slots exist on master');
+
+
+# Verify that only the before base_backup slot is on the replica
+$stdout = $node_replica->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, "ab_failover\nbb_failover", 'Both failover slots exist on replica')
+  or BAIL_OUT('Remaining tests meaningless');
+
+# Boom, crash
+$node_master->stop('fast');
+
+my @slot_updates = @{ read_slot_updates_from_xlog($node_master, 1) };
+
+#
+# Decode the WAL from the master and make sure the expected entries and only the
+# expected entries are present.
+#
+# We want to see two WAL entries, one for each slot. There won't be another entry
+# for the slot advance because right now we don't write out WAL when a slot's confirmed
+# location advances, only when the flush location or xmin advance. The restart lsn
+# and confirmed flush LSN in the slot's WAL record must not be less than the LSN
+# of the master before we created the slot and not greater than the position we saw
+# in pg_replication_slots after slot creation.
+#
+
+check_slot_wal_update($slot_updates[0], 'bb_failover',
+	expect_min_restart_lsn => $master_beforecreate_bb_lsn,
+	expect_min_confirm_lsn => $master_beforecreate_bb_lsn,
+	expect_max_restart_lsn => $bb_beforeconsume_si->{restart_lsn},
+	expect_max_confirm_lsn => $bb_beforeconsume_si->{confirmed_flush_lsn});
+
+check_slot_wal_update($slot_updates[1], 'ab_failover',
+	expect_min_restart_lsn => $master_beforecreate_ab_lsn,
+	expect_min_confirm_lsn => $master_beforecreate_ab_lsn,
+	expect_max_restart_lsn => $ab_beforeconsume_si->{restart_lsn},
+	expect_max_confirm_lsn => $ab_beforeconsume_si->{confirmed_flush_lsn});
+
+# Consuming from a slot does not cause the slot to be written out even on
+# CHECKPOINT. Since nothing else would have dirtied the slot, there should
+# be no more WAL entries for failover slots.
+#
+# The client is expected to keep track of the confirmed LSN and skip replaying
+# data it's already seen.
+ok(!defined($slot_updates[3]), 'Third xlog entry does not exist');
+
+$node_replica->promote;
+
+wait_for_end_of_recovery($node_replica);
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+my $bb_afterpromote_si = get_slot_info($node_replica, 'bb_failover');
+diag_slotinfo $bb_afterpromote_si, 'bb_afterpromote';
+# Because the confirmed LSN didn't get logged, the replica should have the slot
+# at the position it was created at, not the position after we consumed data.
+is($bb_afterpromote_si->{confirmed_flush_lsn}, $bb_beforeconsume_si->{confirmed_flush_lsn},
+	'slot bb_failover confirmed pos on replica has gone backwards');
+# the restart position won't have advanced either since we didn't log any new
+# entries for it and we haven't done enough work to trigger a flush.
+is($bb_afterpromote_si->{restart_lsn}, $bb_beforeconsume_si->{restart_lsn},
+	'slot bb_failover restart position is unchanged');
+
+# Same for the after-basebackup slot.
+my $ab_afterpromote_si = get_slot_info($node_replica, 'ab_failover');
+diag_slotinfo $ab_afterpromote_si, 'ab_afterpromote';
+is($ab_afterpromote_si->{confirmed_flush_lsn}, $ab_beforeconsume_si->{confirmed_flush_lsn},
+	'slot ab_failover confirmed pos on replica has gone backwards');
+is($ab_afterpromote_si->{restart_lsn}, $ab_beforeconsume_si->{restart_lsn},
+	'slot ab_failover restart position is unchanged');
+
+
+
+
+# Can replay from slot ab, following the timeline switch
+test_read_from_slot($node_replica, 'ab_failover', q(BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT));
+
+# Can replay from slot bb too
+#
+# Note that we expect to see data that we already replayed on the master here
+# because the confirm lsn won't be flushed on the master and will go backwards.
+#
+# See http://www.postgresql.org/message-id/CAMsr+YGSaTRGqPcx9qx4eOcizWsa27XjKEiPSOtAJE8OfiXT-g@mail.gmail.com
+#
+# (If Pg is patched to flush all slots on shutdown then this will change, but
+#  it'll still be able to go backwards on an unclean shutdown).
+#
+test_read_from_slot($node_replica, 'bb_failover', q(BEGIN
+table public.decoding: INSERT: blah[text]:'consumed'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT));
+
+$node_replica->stop('fast');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
+
+done_testing();
-- 
2.1.0

#24Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#23)
Re: WIP: Failover Slots

On 15 March 2016 at 21:40, Craig Ringer <craig@2ndquadrant.com> wrote:

Here's a new failover slots rev, addressing the issues Oleksii Kliukin
raised and adding a bunch of TAP tests.

Ahem, just found an issue here. I'll need to send another revision.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#24)
7 attachment(s)
Re: WIP: Failover Slots

OK, here's the latest failover slots patch, rebased on top of today's
master plus, in order:

- Dirty replication slots when confirm_lsn is changed
(
/messages/by-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com
)

- logical decoding timeline following
(
/messages/by-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com
)

The full tree is at
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots if you want
to avoid the fiddling around required to apply the patch series.

Attachments:

0001-Allow-replication-slots-to-follow-failover.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-replication-slots-to-follow-failover.patchDownload
From 575982ce8abb9bbdcb220d35ba9a2b8808a6baf2 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:59:37 +0800
Subject: [PATCH 1/7] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile               |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c         |  65 +++
 src/backend/access/transam/rmgr.c                  |   1 +
 src/backend/access/transam/xlog.c                  |  14 +-
 src/backend/commands/dbcommands.c                  |   3 +
 src/backend/replication/basebackup.c               |  12 -
 src/backend/replication/logical/decode.c           |   1 +
 src/backend/replication/logical/logical.c          |  25 +-
 src/backend/replication/slot.c                     | 586 +++++++++++++++++++--
 src/backend/replication/slotfuncs.c                |   4 +-
 src/backend/replication/walsender.c                |   8 +-
 src/bin/pg_xlogdump/replslotdesc.c                 |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c                     |   1 +
 src/include/access/rmgrlist.h                      |   1 +
 src/include/replication/slot.h                     |  69 +--
 src/include/replication/slot_xlog.h                | 100 ++++
 .../modules/decoding_failover/decoding_failover.c  |   6 +-
 17 files changed, 766 insertions(+), 133 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 120000 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 0000000..5829e8d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+				ReplicationSlotInWAL xlrec;
+
+				xlrec = (ReplicationSlotInWAL) rec;
+
+				appendStringInfo(buf, "of slot %s with restart %X/%X and xid %u confirmed to %X/%X",
+						NameStr(xlrec->name),
+						(uint32)(xlrec->restart_lsn>>32), (uint32)(xlrec->restart_lsn),
+						xlrec->xmin,
+						(uint32)(xlrec->confirmed_flush>>32), (uint32)(xlrec->confirmed_flush));
+
+				break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec;
+
+				xlrec = (xl_replslot_drop *) rec;
+
+				appendStringInfo(buf, "of slot %s", NameStr(xlrec->name));
+
+				break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f70bb49..003610d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6348,8 +6348,11 @@ StartupXLOG(void)
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
+	 *
+	 * If we're in archive recovery then non-failover slots are no
+	 * longer of any use and should be dropped during startup.
 	 */
-	StartupReplicationSlots();
+	StartupReplicationSlots(ArchiveRecoveryRequested);
 
 	/*
 	 * Startup logical state, needs to be setup now so we have proper data
@@ -8182,6 +8185,12 @@ CreateCheckPoint(int flags)
 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
 	/*
+	 * Flush dirty replication slots before we block WAL writes, so
+	 * any failover slots get written out.
+	 */
+	CheckPointReplicationSlots();
+
+	/*
 	 * Prepare to accumulate statistics.
 	 *
 	 * Note: because it is possible for log_checkpoints to change while a
@@ -8622,7 +8631,6 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
@@ -8696,6 +8704,8 @@ CreateRestartPoint(int flags)
 	 */
 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
+	CheckPointReplicationSlots();
+
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c1c0223..61fc45b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,9 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/* Drop any logical failover slots for this database */
+		ReplicationSlotsDropDBSlots(xlrec->db_id);
+
 		/* And remove the physical files */
 		if (!rmtree(dst_path, true))
 			ereport(WARNING,
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index af0fb09..ab1f271 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -973,18 +973,6 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
-		 * Skip pg_replslot, not useful to copy. But include it as an empty
-		 * directory anyway, so we get permissions right.
-		 */
-		if (strcmp(de->d_name, "pg_replslot") == 0)
-		{
-			if (!sizeonly)
-				_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
-			continue;
-		}
-
-		/*
 		 * We can skip pg_xlog, the WAL segments need to be fetched from the
 		 * WAL archive anyway. But include it as an empty directory anyway, so
 		 * we get permissions right.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 13af485..fb500e2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -143,6 +143,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
+		case RM_REPLSLOT_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 40db6ff..d3fb1a5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -85,16 +85,19 @@ CheckLogicalDecodingRequirements(void)
 				 errmsg("logical decoding requires a database connection")));
 
 	/* ----
-	 * TODO: We got to change that someday soon...
+	 * TODO: Allow logical decoding from a standby
 	 *
-	 * There's basically three things missing to allow this:
+	 * There are some things missing to allow this:
 	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
+	 *    LSN belongs to
+	 * 2) To prevent needed rows from being removed we would need
+	 *    to enhance hot_standby_feedback so it sends both xmin and
+	 *    catalog_xmin to the master.  A standby slot can't write WAL, so we
+	 *    wouldn't be able to use it directly for failover, without some very
+	 *    complex state interactions via master.
+	 *
+	 * So this doesn't seem likely to change anytime soon.
+	 *
 	 * ----
 	 */
 	if (RecoveryInProgress())
@@ -272,7 +275,7 @@ CreateInitDecodingContext(char *plugin,
 	slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
 	slot->data.catalog_xmin = slot->effective_catalog_xmin;
 
-	ReplicationSlotsComputeRequiredXmin(true);
+	ReplicationSlotsUpdateRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -920,8 +923,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 				MyReplicationSlot->effective_catalog_xmin = MyReplicationSlot->data.catalog_xmin;
 				SpinLockRelease(&MyReplicationSlot->mutex);
 
-				ReplicationSlotsComputeRequiredXmin(false);
-				ReplicationSlotsComputeRequiredLSN();
+				ReplicationSlotsUpdateRequiredXmin(false);
+				ReplicationSlotsUpdateRequiredLSN();
 			}
 		}
 	}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index ead221d..fbfdc4d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -24,7 +24,18 @@
  * directory. Inside that directory the state file will contain the slot's
  * own data. Additional data can be stored alongside that file if required.
  * While the server is running, the state data is also cached in memory for
- * efficiency.
+ * efficiency. Non-failover slots are NOT subject to WAL logging and may
+ * be used on standbys (though that's only supported for physical slots at
+ * the moment). They use tempfile writes and swaps for crash safety.
+ *
+ * A failover slot created on a master node generates WAL records that
+ * maintain a copy of the slot on standby nodes. If a standby node is
+ * promoted the failover slot allows access to be restarted just as if the
+ * the original master node was being accessed, allowing for the timeline
+ * change. The replica considers slot positions when removing WAL to make
+ * sure it can satisfy the needs of slots after promotion.  For logical
+ * decoding slots the slot's internal state is kept up to date so it's
+ * ready for use after promotion.
  *
  * ReplicationSlotAllocationLock must be taken in exclusive mode to allocate
  * or free a slot. ReplicationSlotControlLock must be taken in shared mode
@@ -44,6 +55,7 @@
 #include "common/string.h"
 #include "miscadmin.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -101,10 +113,14 @@ static LWLockTranche ReplSlotIOLWLockTranche;
 static void ReplicationSlotDropAcquired(void);
 
 /* internal persistency functions */
-static void RestoreSlotFromDisk(const char *name);
+static void RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *path, int elevel);
 
+/* internal redo functions */
+static void ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec);
+static void ReplicationSlotRedoDrop(const char * slotname);
+
 /*
  * Report shared-memory space needed by ReplicationSlotShmemInit.
  */
@@ -220,7 +236,8 @@ ReplicationSlotValidateName(const char *name, int elevel)
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency persistency)
+					  ReplicationSlotPersistency persistency,
+					  bool failover)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -229,6 +246,11 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 	ReplicationSlotValidateName(name, ERROR);
 
+	if (failover && RecoveryInProgress())
+		ereport(ERROR,
+				(errmsg("a failover slot may not be created on a replica"),
+				 errhint("Create the slot on the master server instead")));
+
 	/*
 	 * If some other backend ran this code concurrently with us, we'd likely both
 	 * allocate the same slot, and that would be bad.  We'd also be at risk of
@@ -278,6 +300,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.restart_lsn = InvalidXLogRecPtr;
+	/* Slot timeline is unused and always zero */
+	slot->data.restart_tli = 0;
+	slot->data.failover = failover;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -313,6 +338,10 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 /*
  * Find a previously created slot and mark it as used by this backend.
+ *
+ * Sets active_pid and assigns MyReplicationSlot iff successfully acquired.
+ *
+ * ERRORs on an attempt to acquire a failover slot when in recovery.
  */
 void
 ReplicationSlotAcquire(const char *name)
@@ -335,7 +364,11 @@ ReplicationSlotAcquire(const char *name)
 		{
 			SpinLockAcquire(&s->mutex);
 			active_pid = s->active_pid;
-			if (active_pid == 0)
+			/*
+			 * We can only claim a slot for our use if it's not claimed
+			 * by someone else AND it isn't a failover slot on a standby.
+			 */
+			if (active_pid == 0 && !(RecoveryInProgress() && s->data.failover))
 				s->active_pid = MyProcPid;
 			SpinLockRelease(&s->mutex);
 			slot = s;
@@ -349,12 +382,24 @@ ReplicationSlotAcquire(const char *name)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("replication slot \"%s\" does not exist", name)));
+
 	if (active_pid != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_IN_USE),
 			   errmsg("replication slot \"%s\" is active for PID %d",
 					  name, active_pid)));
 
+	/*
+	 * An attempt to use a failover slot from a standby must fail since
+	 * we can't write WAL from a standby and there's no sensible way
+	 * to advance slot position from both replica and master anyway.
+	 */
+	if (RecoveryInProgress() && slot->data.failover)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("replication slot \"%s\" is reserved for use after failover",
+					  name)));
+
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = slot;
 }
@@ -411,6 +456,9 @@ ReplicationSlotDrop(const char *name)
 /*
  * Permanently drop the currently acquired replication slot which will be
  * released by the point this function returns.
+ *
+ * Callers must NOT hold ReplicationSlotControlLock in SHARED mode.  EXCLUSIVE
+ * is OK, or not held at all.
  */
 static void
 ReplicationSlotDropAcquired(void)
@@ -418,9 +466,14 @@ ReplicationSlotDropAcquired(void)
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	ReplicationSlot *slot = MyReplicationSlot;
+	bool slot_is_failover;
+	bool took_control_lock = false,
+		 took_allocation_lock = false;
 
 	Assert(MyReplicationSlot != NULL);
 
+	slot_is_failover = slot->data.failover;
+
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
@@ -428,8 +481,27 @@ ReplicationSlotDropAcquired(void)
 	 * If some other backend ran this code concurrently with us, we might try
 	 * to delete a slot with a certain name while someone else was trying to
 	 * create a slot with the same name.
+	 *
+	 * If called with the lock already held it MUST be held in
+	 * EXCLUSIVE mode.
 	 */
-	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotAllocationLock))
+	{
+		took_allocation_lock = true;
+		LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+	}
+
+	/* Record the drop in XLOG if we aren't replaying WAL */
+	if (XLogInsertAllowed() && slot_is_failover)
+	{
+		xl_replslot_drop xlrec;
+
+		memcpy(&(xlrec.name), NameStr(slot->data.name), NAMEDATALEN);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_DROP);
+	}
 
 	/* Generate pathnames. */
 	sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
@@ -459,7 +531,11 @@ ReplicationSlotDropAcquired(void)
 	}
 	else
 	{
-		bool		fail_softly = slot->data.persistency == RS_EPHEMERAL;
+		bool		fail_softly = false;
+
+		if (RecoveryInProgress() ||
+			slot->data.persistency == RS_EPHEMERAL)
+			fail_softly = true;
 
 		SpinLockAcquire(&slot->mutex);
 		slot->active_pid = 0;
@@ -477,18 +553,27 @@ ReplicationSlotDropAcquired(void)
 	 * grabbing the mutex because nobody else can be scanning the array here,
 	 * and nobody can be attached to this slot and thus access it without
 	 * scanning the array.
+	 *
+	 * You must hold the lock in EXCLUSIVE mode or not at all.
 	 */
-	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	if (!LWLockHeldByMe(ReplicationSlotControlLock))
+	{
+		took_control_lock = true;
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	}
+
 	slot->active_pid = 0;
 	slot->in_use = false;
-	LWLockRelease(ReplicationSlotControlLock);
+
+	if (took_control_lock)
+		LWLockRelease(ReplicationSlotControlLock);
 
 	/*
 	 * Slot is dead and doesn't prevent resource removal anymore, recompute
 	 * limits.
 	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	/*
 	 * If removing the directory fails, the worst thing that will happen is
@@ -504,7 +589,8 @@ ReplicationSlotDropAcquired(void)
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
-	LWLockRelease(ReplicationSlotAllocationLock);
+	if (took_allocation_lock)
+		LWLockRelease(ReplicationSlotAllocationLock);
 }
 
 /*
@@ -544,6 +630,9 @@ ReplicationSlotMarkDirty(void)
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
+ *
+ * Failover slots will emit a create xlog record at this time, having
+ * not been previously written to xlog.
  */
 void
 ReplicationSlotPersist(void)
@@ -565,7 +654,7 @@ ReplicationSlotPersist(void)
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  */
 void
-ReplicationSlotsComputeRequiredXmin(bool already_locked)
+ReplicationSlotsUpdateRequiredXmin(bool already_locked)
 {
 	int			i;
 	TransactionId agg_xmin = InvalidTransactionId;
@@ -610,10 +699,20 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 }
 
 /*
- * Compute the oldest restart LSN across all slots and inform xlog module.
+ * Update the xlog module's copy of the minimum restart lsn across all slots
  */
 void
-ReplicationSlotsComputeRequiredLSN(void)
+ReplicationSlotsUpdateRequiredLSN(void)
+{
+	XLogSetReplicationSlotMinimumLSN(ReplicationSlotsComputeRequiredLSN(false));
+}
+
+/*
+ * Compute the oldest restart LSN across all slots (or optionally
+ * only failover slots) and return it.
+ */
+XLogRecPtr
+ReplicationSlotsComputeRequiredLSN(bool failover_only)
 {
 	int			i;
 	XLogRecPtr	min_required = InvalidXLogRecPtr;
@@ -625,14 +724,19 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		failover;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		failover = s->data.failover;
 		SpinLockRelease(&s->mutex);
 
+		if (failover_only && !failover)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -640,7 +744,7 @@ ReplicationSlotsComputeRequiredLSN(void)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	XLogSetReplicationSlotMinimumLSN(min_required);
+	return min_required;
 }
 
 /*
@@ -649,7 +753,7 @@ ReplicationSlotsComputeRequiredLSN(void)
  * Returns InvalidXLogRecPtr if logical decoding is disabled or no logical
  * slots exist.
  *
- * NB: this returns a value >= ReplicationSlotsComputeRequiredLSN(), since it
+ * NB: this returns a value >= ReplicationSlotsUpdateRequiredLSN(), since it
  * ignores physical replication slots.
  *
  * The results aren't required frequently, so we don't maintain a precomputed
@@ -747,6 +851,45 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 	return false;
 }
 
+void
+ReplicationSlotsDropDBSlots(Oid dboid)
+{
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->data.database == dboid)
+		{
+			/*
+			 * There should be no connections to this dbid
+			 * therefore all slots for this dbid should be
+			 * logical, inactive failover slots.
+			 */
+			Assert(s->active_pid == 0);
+			Assert(s->in_use == false);
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * Acquire the replication slot
+			 */
+			MyReplicationSlot = s;
+
+			/*
+			 * No need to deactivate slot, especially since we
+			 * already hold ReplicationSlotControlLock.
+			 */
+			ReplicationSlotDropAcquired();
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	MyReplicationSlot = NULL;
+}
 
 /*
  * Check whether the server's configuration supports using replication
@@ -779,12 +922,13 @@ ReplicationSlotReserveWal(void)
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
+	Assert(slot->data.restart_tli == 0);
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
+	 * ReplicationSlotsUpdateRequiredLSN() is used. In the unlikely case that
 	 * this happens we'll just retry.
 	 */
 	while (true)
@@ -821,12 +965,12 @@ ReplicationSlotReserveWal(void)
 		}
 
 		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
+		 * concurrent ReplicationSlotsUpdateRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
 		 * more than twice.
 		 */
@@ -878,7 +1022,7 @@ CheckPointReplicationSlots(void)
  * needs to be run before we start crash recovery.
  */
 void
-StartupReplicationSlots(void)
+StartupReplicationSlots(bool drop_nonfailover_slots)
 {
 	DIR		   *replication_dir;
 	struct dirent *replication_de;
@@ -917,7 +1061,7 @@ StartupReplicationSlots(void)
 		}
 
 		/* looks like a slot in a normal state, restore */
-		RestoreSlotFromDisk(replication_de->d_name);
+		RestoreSlotFromDisk(replication_de->d_name, drop_nonfailover_slots);
 	}
 	FreeDir(replication_dir);
 
@@ -926,8 +1070,8 @@ StartupReplicationSlots(void)
 		return;
 
 	/* Now that we have recovered all the data, compute replication xmin */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 }
 
 /* ----
@@ -996,6 +1140,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 /*
  * Shared functionality between saving and creating a replication slot.
+ *
+ * For failover slots this is where we emit xlog.
  */
 static void
 SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
@@ -1006,15 +1152,18 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
 
-	/* first check whether there's something to write out */
-	SpinLockAcquire(&slot->mutex);
-	was_dirty = slot->dirty;
-	slot->just_dirtied = false;
-	SpinLockRelease(&slot->mutex);
+	if (!RecoveryInProgress())
+	{
+		/* first check whether there's something to write out */
+		SpinLockAcquire(&slot->mutex);
+		was_dirty = slot->dirty;
+		slot->just_dirtied = false;
+		SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
-		return;
+		/* and don't do anything if there's nothing to write */
+		if (!was_dirty)
+			return;
+	}
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
 
@@ -1047,6 +1196,25 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	SpinLockRelease(&slot->mutex);
 
+	/*
+	 * If needed, record this action in WAL
+	 */
+	if (slot->data.failover &&
+		slot->data.persistency == RS_PERSISTENT &&
+		!RecoveryInProgress())
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&cp.slotdata), sizeof(ReplicationSlotPersistentData));
+		/*
+		 * Note that slot creation on the downstream is also an "update".
+		 *
+		 * Slots can start off ephemeral and be updated to persistent. We just
+		 * log the update and the downstream creates the new slot if it doesn't
+		 * exist yet.
+		 */
+		(void) XLogInsert(RM_REPLSLOT_ID, XLOG_REPLSLOT_UPDATE);
+	}
+
 	COMP_CRC32C(cp.checksum,
 				(char *) (&cp) + SnapBuildOnDiskNotChecksummedSize,
 				SnapBuildOnDiskChecksummedSize);
@@ -1116,7 +1284,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
  * Load a single slot from disk into memory.
  */
 static void
-RestoreSlotFromDisk(const char *name)
+RestoreSlotFromDisk(const char *name, bool drop_nonfailover_slots)
 {
 	ReplicationSlotOnDisk cp;
 	int			i;
@@ -1235,10 +1403,21 @@ RestoreSlotFromDisk(const char *name)
 						path, checksum, cp.checksum)));
 
 	/*
-	 * If we crashed with an ephemeral slot active, don't restore but delete
-	 * it.
+	 * If we crashed with an ephemeral slot active, don't restore but
+	 * delete it.
+	 *
+	 * Similarly, if we're in archive recovery and will be running as
+	 * a standby (when drop_nonfailover_slots is set), non-failover
+	 * slots can't be relied upon. Logical slots might have a catalog
+	 * xmin lower than reality because the original slot on the master
+	 * advanced past the point the stale slot on the replica is stuck
+	 * at. Additionally slots might have been copied while being
+	 * written to if the basebackup copy method was not atomic.
+	 * Failover slots are safe since they're WAL-logged and follow the
+	 * master's slot position.
 	 */
-	if (cp.slotdata.persistency != RS_PERSISTENT)
+	if (cp.slotdata.persistency != RS_PERSISTENT
+			|| (drop_nonfailover_slots && !cp.slotdata.failover))
 	{
 		sprintf(path, "pg_replslot/%s", name);
 
@@ -1249,6 +1428,14 @@ RestoreSlotFromDisk(const char *name)
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
 		fsync_fname("pg_replslot", true);
+
+		if (cp.slotdata.persistency == RS_PERSISTENT)
+		{
+			ereport(LOG,
+					(errmsg("dropped non-failover slot %s during archive recovery",
+							 NameStr(cp.slotdata.name))));
+		}
+
 		return;
 	}
 
@@ -1285,5 +1472,332 @@ RestoreSlotFromDisk(const char *name)
 	if (!restored)
 		ereport(PANIC,
 				(errmsg("too many replication slots active before shutdown"),
-				 errhint("Increase max_replication_slots and try again.")));
+				 errhint("Increase max_replication_slots (currently %u) and try again.",
+					 max_replication_slots)));
+}
+
+/*
+ * This usually just writes new persistent data to the slot state, but an
+ * update record might create a new slot on the downstream if we changed a
+ * previously ephemeral slot to persistent. We have to decide which
+ * by looking for the existing slot.
+ */
+static void
+ReplicationSlotRedoCreateOrUpdate(ReplicationSlotInWAL xlrec)
+{
+	ReplicationSlot *slot;
+	bool	found_available = false;
+	bool	found_duplicate = false;
+	int		use_slotid = 0;
+	int		i;
+
+	/*
+	 * We're in redo, but someone could still create a local
+	 * non-failover slot and race with us unless we take the
+	 * allocation lock.
+	 */
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		slot = &ReplicationSlotCtl->replication_slots[i];
+
+		/*
+		 * Find first unused position in the slots array, but keep on
+		 * scanning in case there's an existing slot with the same
+		 * name.
+		 */
+		if (!slot->in_use && !found_available)
+		{
+			use_slotid = i;
+			found_available = true;
+		}
+
+		/*
+		 * Existing slot with same name? It could be our failover slot
+		 * to update or a non-failover slot with a conflicting name.
+		 */
+		if (strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0)
+		{
+			use_slotid = i;
+			found_available = true;
+			found_duplicate = true;
+			break;
+		}
+	}
+
+	if (found_duplicate && !slot->data.failover)
+	{
+		/*
+		 * A local non-failover slot exists with the same name as
+		 * the failover slot we're creating.
+		 *
+		 * Clobber the client, drop its slot, and carry on with
+		 * our business.
+		 *
+		 * First we must temporarily release the allocation lock while
+		 * we try to terminate the process that holds the slot, since
+		 * we don't want to hold the LWlock for ages. We'll reacquire
+		 * it later.
+		 */
+		LWLockRelease(ReplicationSlotAllocationLock);
+
+		/* We might race with other clients, so retry-loop */
+		do
+		{
+			int active_pid = slot->active_pid;
+			int max_sleep_millis = 120 * 1000;
+			int millis_per_sleep = 1000;
+
+			if (active_pid != 0)
+			{
+				ereport(INFO,
+						(errmsg("terminating active connection by pid %u to local slot %s because of conflict with recovery",
+							active_pid, NameStr(slot->data.name))));
+
+				if (kill(active_pid, SIGTERM))
+					elog(DEBUG1, "failed to signal pid %u to terminate on slot conflict: %m",
+							active_pid);
+
+				/*
+				 * Wait for the process using the slot to die. This just uses the
+				 * latch to poll; the process won't set our latch when it releases
+				 * the slot and dies.
+				 *
+				 * We're checking active_pid without any locks held, but we'll
+				 * recheck when we attempt to drop the slot.
+				 */
+				while (slot->in_use && slot->active_pid == active_pid
+						&& max_sleep_millis > 0)
+				{
+					int rc;
+
+					rc = WaitLatch(MyLatch,
+							WL_TIMEOUT | WL_LATCH_SET | WL_POSTMASTER_DEATH,
+							millis_per_sleep);
+
+					if (rc & WL_POSTMASTER_DEATH)
+						elog(FATAL, "exiting after postmaster termination");
+
+					/*
+					 * Might be shorter if something sets our latch, but
+					 * we don't care much.
+					 */
+					max_sleep_millis -= millis_per_sleep;
+				}
+
+				if (max_sleep_millis <= 0)
+					elog(WARNING, "process %u is taking too long to terminate after SIGTERM",
+							slot->active_pid);
+			}
+
+			if (slot->active_pid == 0)
+			{
+				/* Try to acquire and drop the slot */
+				SpinLockAcquire(&slot->mutex);
+
+				if (slot->active_pid != 0)
+				{
+					/* Lost the race, go around */
+				}
+				else
+				{
+					/* Claim the slot for ourselves */
+					slot->active_pid = MyProcPid;
+					MyReplicationSlot = slot;
+				}
+				SpinLockRelease(&slot->mutex);
+			}
+
+			if (slot->active_pid == MyProcPid)
+			{
+				NameData slotname;
+				strncpy(NameStr(slotname), NameStr(slot->data.name), NAMEDATALEN);
+				(NameStr(slotname))[NAMEDATALEN-1] = '\0';
+
+				/*
+				 * Reclaim the allocation lock and THEN drop the slot,
+				 * so nobody else can grab the name until we've
+				 * finished redo.
+				 */
+				LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
+				ReplicationSlotDropAcquired();
+				/* We clobbered the duplicate, treat it as new */
+				found_duplicate = false;
+
+				ereport(WARNING,
+						(errmsg("dropped local replication slot %s because of conflict with recovery",
+								NameStr(slotname)),
+						 errdetail("A failover slot with the same name was created on the master server")));
+			}
+		}
+		while (slot->in_use);
+	}
+
+	Assert(LWLockHeldByMe(ReplicationSlotAllocationLock));
+
+	/*
+	 * This is either an empty slot control position to make a new slot or it's
+	 * an existing entry for this failover slot that we need to update.
+	 */
+	if (found_available)
+	{
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
+
+		slot = &ReplicationSlotCtl->replication_slots[use_slotid];
+
+		/* restore the entire set of persistent data */
+		memcpy(&slot->data, xlrec,
+			   sizeof(ReplicationSlotPersistentData));
+
+		Assert(strcmp(NameStr(xlrec->name), NameStr(slot->data.name)) == 0);
+		Assert(slot->data.failover && slot->data.persistency == RS_PERSISTENT);
+
+		/* Update the non-persistent in-memory state */
+		slot->effective_xmin = xlrec->xmin;
+		slot->effective_catalog_xmin = xlrec->catalog_xmin;
+
+		if (found_duplicate)
+		{
+			char		path[MAXPGPATH];
+
+			/* Write an existing slot to disk */
+			Assert(slot->in_use);
+			Assert(slot->active_pid == 0); /* can't be replaying from failover slot */
+
+			sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+			slot->dirty = true;
+			SaveSlotToPath(slot, path, ERROR);
+		}
+		else
+		{
+			Assert(!slot->in_use);
+			/* In-memory state that's only set on create, not update */
+			slot->active_pid = 0;
+			slot->in_use = true;
+			slot->candidate_catalog_xmin = InvalidTransactionId;
+			slot->candidate_xmin_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_lsn = InvalidXLogRecPtr;
+			slot->candidate_restart_valid = InvalidXLogRecPtr;
+
+			CreateSlotOnDisk(slot);
+		}
+
+		LWLockRelease(ReplicationSlotControlLock);
+
+		ReplicationSlotsUpdateRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredLSN();
+	}
+
+	LWLockRelease(ReplicationSlotAllocationLock);
+
+	if (!found_available)
+	{
+		/*
+		 * Because the standby should have the same or greater max_replication_slots
+		 * as the master this shouldn't happen, but just in case...
+		 */
+		ereport(ERROR,
+				(errmsg("max_replication_slots exceeded, cannot replay failover slot creation"),
+				 errhint("Increase max_replication_slots")));
+	}
+}
+
+/*
+ * Redo a slot drop of a failover slot. This might be a redo during crash
+ * recovery on the master or it may be replay on a standby.
+ */
+static void
+ReplicationSlotRedoDrop(const char * slotname)
+{
+	/*
+	 * Acquire the failover slot that's to be dropped.
+	 *
+	 * We can't ReplicationSlotAcquire here because we want to acquire
+	 * a replication slot during replay, which isn't usually allowed.
+	 * Also, because we might crash midway through a drop we can't
+	 * assume we'll actually find the slot so it's not an error for
+	 * the slot to be missing.
+	 */
+	int			i;
+
+	Assert(MyReplicationSlot == NULL);
+
+	ReplicationSlotValidateName(slotname, ERROR);
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (s->in_use && strcmp(slotname, NameStr(s->data.name)) == 0)
+		{
+			if (!s->data.persistency == RS_PERSISTENT)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found conflicting non-persistent slot during failover slot drop");
+				break;
+			}
+
+			if (!s->data.failover)
+			{
+				/* shouldn't happen */
+				elog(WARNING, "found non-failover slot during redo of slot drop");
+				break;
+			}
+
+			/* A failover slot can't be active during recovery */
+			Assert(s->active_pid == 0);
+
+			/* Claim the slot */
+			s->active_pid = MyProcPid;
+			MyReplicationSlot = s;
+
+			break;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (MyReplicationSlot != NULL)
+	{
+		ReplicationSlotDropAcquired();
+	}
+	else
+	{
+		elog(WARNING, "failover slot %s not found during redo of drop",
+				slotname);
+	}
+}
+
+void
+replslot_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		/*
+		 * Update the values for an existing failover slot or, when a slot
+		 * is first logged as persistent, create it on the downstream.
+		 */
+		case XLOG_REPLSLOT_UPDATE:
+			ReplicationSlotRedoCreateOrUpdate((ReplicationSlotInWAL) XLogRecGetData(record));
+			break;
+
+		/*
+		 * Drop an existing failover slot.
+		 */
+		case XLOG_REPLSLOT_DROP:
+			{
+				xl_replslot_drop *xlrec =
+				(xl_replslot_drop *) XLogRecGetData(record);
+
+				ReplicationSlotRedoDrop(NameStr(xlrec->name));
+
+				break;
+			}
+
+		default:
+			elog(PANIC, "replslot_redo: unknown op code %u", info);
+	}
 }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9cc24ea..f430714 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -57,7 +57,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -120,7 +120,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f98475c..53de576 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -794,7 +794,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
 	}
 	else
 	{
@@ -805,7 +805,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
 	}
 
 	initStringInfo(&output_message);
@@ -1525,7 +1525,7 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredLSN();
+		ReplicationSlotsUpdateRequiredLSN();
 	}
 
 	/*
@@ -1621,7 +1621,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin)
 	if (changed)
 	{
 		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
+		ReplicationSlotsUpdateRequiredXmin(false);
 	}
 }
 
diff --git a/src/bin/pg_xlogdump/replslotdesc.c b/src/bin/pg_xlogdump/replslotdesc.c
new file mode 120000
index 0000000..2e088d2
--- /dev/null
+++ b/src/bin/pg_xlogdump/replslotdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/replslotdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index f9cd395..73ed7d4 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -26,6 +26,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
+#include "replication/slot_xlog.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index fab912d..124b7e5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -45,3 +45,4 @@ PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_start
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
+PG_RMGR(RM_REPLSLOT_ID, "ReplicationSlot", replslot_redo, replslot_desc, replslot_identify, NULL, NULL)
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8be8ab6..cdcbd37 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -11,69 +11,12 @@
 
 #include "fmgr.h"
 #include "access/xlog.h"
-#include "access/xlogreader.h"
+#include "replication/slot_xlog.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 /*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crashsafe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.
- *
- * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL
-} ReplicationSlotPersistency;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* oldest LSN that the client has acked receipt for */
-	XLogRecPtr	confirmed_flush;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
-/*
  * Shared memory state of a single replication slot.
  */
 typedef struct ReplicationSlot
@@ -155,7 +98,7 @@ extern void ReplicationSlotsShmemInit(void);
 
 /* management of individual slots */
 extern void ReplicationSlotCreate(const char *name, bool db_specific,
-					  ReplicationSlotPersistency p);
+					  ReplicationSlotPersistency p, bool failover);
 extern void ReplicationSlotPersist(void);
 extern void ReplicationSlotDrop(const char *name);
 
@@ -167,12 +110,14 @@ extern void ReplicationSlotMarkDirty(void);
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
 extern void ReplicationSlotReserveWal(void);
-extern void ReplicationSlotsComputeRequiredXmin(bool already_locked);
-extern void ReplicationSlotsComputeRequiredLSN(void);
+extern void ReplicationSlotsUpdateRequiredXmin(bool already_locked);
+extern void ReplicationSlotsUpdateRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
+extern XLogRecPtr ReplicationSlotsComputeRequiredLSN(bool failover_only);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
+extern void ReplicationSlotsDropDBSlots(Oid dboid);
 
-extern void StartupReplicationSlots(void);
+extern void StartupReplicationSlots(bool drop_nonfailover_slots);
 extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
diff --git a/src/include/replication/slot_xlog.h b/src/include/replication/slot_xlog.h
new file mode 100644
index 0000000..e3211f5
--- /dev/null
+++ b/src/include/replication/slot_xlog.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ * slot_xlog.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * src/include/replication/slot_xlog.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_XLOG_H
+#define SLOT_XLOG_H
+
+#include "fmgr.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crashsafe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.
+ *
+ * EPHEMERAL slots can be made PERSISTENT by calling ReplicationSlotPersist().
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * Slots created on master become failover-slots and are maintained
+	 * on all standbys, but are only assignable after failover.
+	 */
+	bool		failover;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+	TimeLineID	restart_tli;
+
+	/* oldest LSN that the client has acked receipt for */
+	XLogRecPtr	confirmed_flush;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+typedef ReplicationSlotPersistentData *ReplicationSlotInWAL;
+
+/*
+ * WAL records for failover slots
+ */
+#define XLOG_REPLSLOT_UPDATE	0x10
+#define XLOG_REPLSLOT_DROP		0x20
+
+typedef struct xl_replslot_drop
+{
+	NameData	name;
+} xl_replslot_drop;
+
+/* WAL logging */
+extern void replslot_redo(XLogReaderState *record);
+extern void replslot_desc(StringInfo buf, XLogReaderState *record);
+extern const char *replslot_identify(uint8 info);
+
+#endif   /* SLOT_XLOG_H */
diff --git a/src/test/modules/decoding_failover/decoding_failover.c b/src/test/modules/decoding_failover/decoding_failover.c
index bab0f3b..8fcfda5 100644
--- a/src/test/modules/decoding_failover/decoding_failover.c
+++ b/src/test/modules/decoding_failover/decoding_failover.c
@@ -37,7 +37,7 @@ decoding_failover_create_logical_slot(PG_FUNCTION_ARGS)
 
 	CheckSlotRequirements();
 
-	ReplicationSlotCreate(slotname, true, RS_PERSISTENT);
+	ReplicationSlotCreate(slotname, true, RS_PERSISTENT, false);
 
 	/* register the plugin name with the slot */
 	StrNCpy(NameStr(MyReplicationSlot->data.plugin), plugin, NAMEDATALEN);
@@ -99,8 +99,8 @@ decoding_failover_advance_logical_slot(PG_FUNCTION_ARGS)
 	ReplicationSlotSave();
 	ReplicationSlotRelease();
 
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
+	ReplicationSlotsUpdateRequiredXmin(false);
+	ReplicationSlotsUpdateRequiredLSN();
 
 	PG_RETURN_VOID();
 }
-- 
2.1.0

0002-Update-decoding_failover-tests-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0002-Update-decoding_failover-tests-for-failover-slots.patchDownload
From 1754994130106bbf2a024e129fbe0fc4818e737a Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 8 Mar 2016 14:34:36 +0800
Subject: [PATCH 2/7] Update decoding_failover tests for failover slots

---
 .../recovery/t/006_logical_decoding_timelines.pl   | 29 +++++++++-------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/src/test/recovery/t/006_logical_decoding_timelines.pl b/src/test/recovery/t/006_logical_decoding_timelines.pl
index 1372d90..ed6cac7 100644
--- a/src/test/recovery/t/006_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/006_logical_decoding_timelines.pl
@@ -19,7 +19,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 20;
+use Test::More tests => 19;
 use RecursiveCopy;
 use File::Copy;
 
@@ -64,7 +64,7 @@ $node_master->safe_psql('postgres', 'CHECKPOINT;');
 
 # Verify that only the before base_backup slot is on the replica
 $stdout = $node_replica->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
-is($stdout, 'before_basebackup', 'Expected to find only slot before_basebackup on replica');
+is($stdout, '', 'Expected to find no slots on replica');
 
 # Boom, crash
 $node_master->stop('immediate');
@@ -86,22 +86,16 @@ like(
 	qr/replication slot "after_basebackup" does not exist/,
 	'after_basebackup slot missing');
 
-# Should be able to read from slot created before base backup
+# or before_basebackup, since pg_basebackup dropped it
 ($ret, $stdout, $stderr) = $node_replica->psql(
 	'postgres',
 "SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
 	timeout => 30);
-is($ret, 0, 'replay from slot before_basebackup succeeds');
-is( $stdout, q(BEGIN
-table public.decoding: INSERT: blah[text]:'beforebb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'afterbb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'after failover'
-COMMIT), 'decoded expected data from slot before_basebackup');
-is($stderr, '', 'replay from slot before_basebackup produces no stderr');
+is($ret, 3, 'replaying from beforebasebackup slot fails');
+like(
+	$stderr,
+	qr/replication slot "before_basebackup" does not exist/,
+	'before_basebackup slot missing');
 
 # We don't need the standby anymore
 $node_replica->teardown_node();
@@ -121,9 +115,10 @@ is($node_master->psql('postgres', 'SELECT pg_drop_replication_slot(slot_name) FR
   0, 'dropping slots succeeds via pg_drop_replication_slot');
 
 # Same as before, we'll make one slot before basebackup, one after. This time
-# the basebackup will be with pg_basebackup so it'll omit both slots, then
-# we'll use SQL functions provided by the decoding_failover test module to
-# sync them to the replica, do some work, sync them and fail over then test
+# the basebackup will be with pg_basebackup. It'll copy the before_basebackup slot
+# but since it's a non-failover slot the server will drop it immediately.
+# We'll use SQL functions provided by the decoding_failover test module to
+# sync both slots to the replica, do some work, sync them and fail over then test
 # again. This time we should have both the before- and after-basebackup
 # slots working.
 
-- 
2.1.0

0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchtext/x-patch; charset=US-ASCII; name=0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patchDownload
From db40118d81959e583bf3c0a1964a52470b849edb Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:00:09 +0800
Subject: [PATCH 3/7] Retain extra WAL for failover slots in base backups

Change the return value of pg_start_backup(), the BASE_BACKUP walsender
command, etc to report the minimum WAL required by any failover slot if
this is a lower LSN than the redo position so that base backups contain
the WAL required for slots to work.

Add a new backup label entry 'MIN FAILOVER SLOT LSN' that, if present,
indicates the minimum LSN needed by any failover slot that is present in
the base backup. Backup tools should check for this entry and ensure
they retain all xlogs including and after that point.
---
 src/backend/access/transam/xlog.c | 41 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 003610d..5df2a59 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9774,6 +9774,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	XLogRecPtr  slot_startpoint;
 	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -9920,6 +9921,17 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
+			/*
+			 * If failover slots are in use we must retain and transfer WAL
+			 * older than the redo location so that those slots can be replayed
+			 * from after a failover event.
+			 *
+			 * This MUST be at an xlog segment boundary so truncate the LSN
+			 * appropriately.
+			 */
+			if (max_replication_slots > 0)
+				slot_startpoint = (ReplicationSlotsComputeRequiredLSN(true)/ XLOG_SEG_SIZE) * XLOG_SEG_SIZE;
+
 			if (backup_started_in_recovery)
 			{
 				XLogRecPtr	recptr;
@@ -10088,6 +10100,10 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 						 backup_started_in_recovery ? "standby" : "master");
 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+		if (slot_startpoint != InvalidXLogRecPtr)
+			appendStringInfo(&labelfbuf,  "MIN FAILOVER SLOT LSN: %X/%X\n",
+						(uint32)(slot_startpoint>>32), (uint32)slot_startpoint);
+
 
 		/*
 		 * Okay, write the file, or return its contents to caller.
@@ -10181,9 +10197,34 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
+	 *
+	 * pg_basebackup etc expect to use this as the position to start copying
+	 * WAL from, so we should return the minimum of the slot start LSN and the
+	 * current redo position to make sure we get all WAL required by failover
+	 * slots.
+	 *
+	 * The min required LSN for failover slots is also available from the
+	 * 'MIN FAILOVER SLOT LSN' entry in the backup label file.
 	 */
+	if (slot_startpoint != InvalidXLogRecPtr && slot_startpoint < startpoint)
+	{
+		List *history;
+		TimeLineID slot_start_tli;
+
+		/* Min LSN required by a slot may be on an older timeline. */
+		history = readTimeLineHistory(ThisTimeLineID);
+		slot_start_tli = tliOfPointInHistory(slot_startpoint, history);
+		list_free_deep(history);
+
+		if (slot_start_tli < starttli)
+			starttli = slot_start_tli;
+
+		startpoint = slot_startpoint;
+	}
+
 	if (starttli_p)
 		*starttli_p = starttli;
+
 	return startpoint;
 }
 
-- 
2.1.0

0004-Add-the-UI-and-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0004-Add-the-UI-and-for-failover-slots.patchDownload
From 7463639c587176a1dbf8cc6eda33e92cd345b05c Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 16:04:05 +0800
Subject: [PATCH 4/7] Add the UI and for failover slots

Expose failover slots to the user.

Add a new 'failover' argument to pg_create_logical_replication_slot and
pg_create_physical_replication_slot . Accept a new FAILOVER keyword
argument in CREATE_REPLICATION_SLOT on the walsender protocol.
---
 contrib/test_decoding/expected/ddl.out |  3 +++
 contrib/test_decoding/sql/ddl.sql      |  2 ++
 src/backend/catalog/system_views.sql   | 11 ++++++++++-
 src/backend/replication/repl_gram.y    | 13 +++++++++++--
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/slotfuncs.c    |  7 +++++--
 src/backend/replication/walsender.c    |  4 ++--
 src/include/catalog/pg_proc.h          |  4 ++--
 src/include/nodes/replnodes.h          |  1 +
 src/include/replication/slot.h         |  1 +
 10 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 77719e8..6353930 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -9,6 +9,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 ERROR:  replication slot "regression_slot" already exists
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
+ERROR:  replication slot "regression_slot" already exists
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 ERROR:  replication slot name "Invalid Name" contains invalid character
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index ad928ad..5a94747 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -4,6 +4,8 @@ SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 -- fail because of an already existing slot
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+-- fail because a failover slot can't replace a normal slot on the master
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', true);
 -- fail because of an invalid name
 SELECT 'init' FROM pg_create_logical_replication_slot('Invalid Name', 'test_decoding');
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fef67bd..593b3e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -968,12 +968,21 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
     IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-    OUT slot_name name, OUT xlog_position pg_lsn)
+    IN failover boolean DEFAULT false, OUT slot_name name,
+    OUT xlog_position pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_physical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot(
+    IN slot_name name, IN plugin name, IN failover boolean DEFAULT false,
+    OUT slot_name text, OUT xlog_position pg_lsn)
+RETURNS RECORD
+LANGUAGE INTERNAL
+STRICT VOLATILE
+AS 'pg_create_logical_replication_slot';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index d93db88..1574f24 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ Node *replication_parse_result;
 %token K_LOGICAL
 %token K_SLOT
 %token K_RESERVE_WAL
+%token K_FAILOVER
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -90,6 +91,7 @@ Node *replication_parse_result;
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot
 %type <boolval>	opt_reserve_wal
+%type <boolval> opt_failover
 
 %%
 
@@ -184,23 +186,25 @@ base_backup_opt:
 
 create_replication_slot:
 			/* CREATE_REPLICATION_SLOT slot PHYSICAL RESERVE_WAL */
-			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal
+			K_CREATE_REPLICATION_SLOT IDENT K_PHYSICAL opt_reserve_wal opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_PHYSICAL;
 					cmd->slotname = $2;
 					cmd->reserve_wal = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT
+			| K_CREATE_REPLICATION_SLOT IDENT K_LOGICAL IDENT opt_failover
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->plugin = $4;
+					cmd->failover = $5;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -276,6 +280,11 @@ opt_reserve_wal:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_failover:
+			K_FAILOVER						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index f83ec53..a1d9f10 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -98,6 +98,7 @@ PHYSICAL			{ return K_PHYSICAL; }
 RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
+FAILOVER			{ return K_FAILOVER; }
 
 ","				{ return ','; }
 ";"				{ return ';'; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f430714..a2dfc40 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "replication/slot.h"
+#include "replication/slot_xlog.h"
 #include "replication/logical.h"
 #include "replication/logicalfuncs.h"
 #include "utils/builtins.h"
@@ -41,6 +42,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	bool 		immediately_reserve = PG_GETARG_BOOL(1);
+	bool		failover = PG_GETARG_BOOL(2);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
@@ -57,7 +59,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	CheckSlotRequirements();
 
 	/* acquire replication slot, this will check for conflicting names */
-	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, false);
+	ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT, failover);
 
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
@@ -96,6 +98,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 {
 	Name		name = PG_GETARG_NAME(0);
 	Name		plugin = PG_GETARG_NAME(1);
+	bool		failover = PG_GETARG_BOOL(2);
 
 	LogicalDecodingContext *ctx = NULL;
 
@@ -120,7 +123,7 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	 * errors during initialization because it'll get dropped if this
 	 * transaction fails. We'll make it persistent at the end.
 	 */
-	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, false);
+	ReplicationSlotCreate(NameStr(*name), true, RS_EPHEMERAL, failover);
 
 	/*
 	 * Create logical decoding context, to build the initial snapshot.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53de576..fb7336c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -794,7 +794,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
-		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, false);
+		ReplicationSlotCreate(cmd->slotname, false, RS_PERSISTENT, cmd->failover);
 	}
 	else
 	{
@@ -805,7 +805,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 * handle errors during initialization because it'll get dropped if
 		 * this transaction fails. We'll make it persistent at the end.
 		 */
-		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, false);
+		ReplicationSlotCreate(cmd->slotname, true, RS_EPHEMERAL, cmd->failover);
 	}
 
 	initStringInfo(&output_message);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ceb8129..0b3d7ed 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5095,13 +5095,13 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 /* replication slots */
-DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 16" "{19,16,19,3220}" "{i,i,o,o}" "{slot_name,immediately_reserve,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 16 16" "{19,16,16,19,3220}" "{i,i,i,o,o}" "{slot_name,immediately_reserve,failover,slot_name,xlog_position}" _null_ _null_ pg_create_physical_replication_slot _null_ _null_ _null_ ));
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
 DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
-DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2249 "19 19" "{19,19,25,3220}" "{i,i,o,o}" "{slot_name,plugin,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
+DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
 DATA(insert OID = 3782 (  pg_logical_slot_get_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v u 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,25}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ _null_ pg_logical_slot_get_changes _null_ _null_ _null_ ));
 DESCR("get changes from replication slot");
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index d2f1edb..a8fa9d5 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		reserve_wal;
+	bool		failover;
 } CreateReplicationSlotCmd;
 
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index cdcbd37..9e23a29 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2012-2016, PostgreSQL Global Development Group
  *
+ * src/include/replication/slot.h
  *-------------------------------------------------------------------------
  */
 #ifndef SLOT_H
-- 
2.1.0

0005-Document-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0005-Document-failover-slots.patchDownload
From e8418b58537be1f696bbdb24757f8c25dedc1826 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:31:13 +0800
Subject: [PATCH 5/7] Document failover slots

---
 doc/src/sgml/func.sgml              | 15 +++++++++-----
 doc/src/sgml/high-availability.sgml | 41 +++++++++++++++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml   |  2 +-
 doc/src/sgml/protocol.sgml          | 19 ++++++++++++++++-
 4 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 000489d..3f4d35b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17892,7 +17892,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>, <parameter>immediately_reserve</> <type>boolean</> </optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <optional><parameter>immediately_reserve</> <type>boolean</></optional>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17903,7 +17903,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         when <literal>true</>, specifies that the <acronym>LSN</> for this
         replication slot be reserved immediately; otherwise
         the <acronym>LSN</> is reserved on first connection from a streaming
-        replication client. Streaming changes from a physical slot is only
+        replication client. If <literal>failover</literal> is <literal>true</literal>
+        then the slot is created as a failover slot; see <xref
+        linkend="streaming-replication-slots-failover">.
+        Streaming changes from a physical slot is only
         possible with the streaming-replication protocol &mdash;
         see <xref linkend="protocol-replication">. This function corresponds
         to the replication protocol command <literal>CREATE_REPLICATION_SLOT
@@ -17932,7 +17935,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_logical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>)</function></literal>
+        <literal><function>pg_create_logical_replication_slot(<parameter>slot_name</parameter> <type>name</type>, <parameter>plugin</parameter> <type>name</type>, <optional><parameter>failover</> <type>boolean</></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>xlog_position</parameter> <type>pg_lsn</type>)
@@ -17940,8 +17943,10 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new logical (decoding) replication slot named
         <parameter>slot_name</parameter> using the output plugin
-        <parameter>plugin</parameter>.  A call to this function has the same
-        effect as the replication protocol command
+        <parameter>plugin</parameter>. If <literal>failover</literal>
+        is <literal>true</literal> the slot is created as a failover
+        slot; see <xref linkend="streaming-replication-slots-failover">. A call to
+        this function has the same effect as the replication protocol command
         <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
        </entry>
       </row>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6cb690c..4b75175 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -949,6 +949,47 @@ primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
+
+   <sect3 id="streaming-replication-slots-failover" xreflabel="Failover slots">
+     <title>Failover slots</title>
+
+     <para>
+      Normally a replication slot is not preserved across backup and restore
+      (such as by <application>pg_basebackup</application>) and is not
+      replicated to standbys. Slots are <emphasis>automatically
+      dropped</emphasis> when starting up as a streaming replica or in archive
+      recovery (PITR) mode.
+     </para>
+
+     <para>
+      To make it possible to for an application to consistently follow
+      failover when a replica is promoted to a new master a slot may be
+      created as a <emphasis>failover slot</emphasis>. A failover slot may
+      only be created, replayed from or dropped on a master server. Changes to
+      the slot are written to WAL and replicated to standbys. When a standby
+      is promoted applications may connect to the slot on the standby and
+      resume replay from it at a consistent point, as if it was the original
+      master. Failover slots may not be used to replay from a standby before
+      promotion.
+     </para>
+
+     <para>
+      Non-failover slots may be created on and used from a replica. This is
+      currently limited to physical slots as logical decoding is not supported
+      on replica server.
+     </para>
+
+     <para>
+      When a failover slot created on the master has the same name as a
+      non-failover slot on a replica server the non-failover slot will be
+      automatically dropped. Any client currently connected will be
+      disconnected with an error indicating a conflict with recovery. It
+      is strongly recommended that you avoid creating failover slots with
+      the same name as slots on replicas.
+     </para>
+
+   </sect3>
+
   </sect2>
 
   <sect2 id="cascading-replication">
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 046f009..c038669 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -288,7 +288,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
     The commands
     <itemizedlist>
      <listitem>
-      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable></literal></para>
+      <para><literal>CREATE_REPLICATION_SLOT <replaceable>slot_name</replaceable> LOGICAL <replaceable>output_plugin</replaceable> <optional>FAILOVER</optional></literal></para>
      </listitem>
 
      <listitem>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 522128e..33b6830 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1434,7 +1434,7 @@ The commands accepted in walsender mode are:
   </varlistentry>
 
   <varlistentry>
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> [ <literal>RESERVE_WAL</> ] | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</> { <literal>PHYSICAL</> <optional><literal>RESERVE_WAL</></> | <literal>LOGICAL</> <replaceable class="parameter">output_plugin</> } <optional><literal>FAILOVER</></>
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1474,6 +1474,17 @@ The commands accepted in walsender mode are:
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry>
+       <term><literal>FAILOVER</></term>
+       <listitem>
+        <para>
+         Create this slot as a <link linkend="streaming-replication-slots-failover">
+         failover slot</link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </listitem>
   </varlistentry>
@@ -1829,6 +1840,12 @@ The commands accepted in walsender mode are:
       to process the output for streaming.
      </para>
 
+     <para>
+      Logical replication automatically follows timeline switches. It is
+      not necessary or possible to supply a <literal>TIMELINE</literal>
+      option like in physical replication.
+     </para>
+
      <variablelist>
       <varlistentry>
        <term><literal>SLOT</literal> <replaceable class="parameter">slot_name</></term>
-- 
2.1.0

0006-Add-failover-to-pg_replication_slots.patchtext/x-patch; charset=US-ASCII; name=0006-Add-failover-to-pg_replication_slots.patchDownload
From 5def8af4f381f8374ca209bd60e9311a6c876025 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 23 Feb 2016 15:55:01 +0800
Subject: [PATCH 6/7] Add 'failover' to pg_replication_slots

---
 contrib/test_decoding/expected/ddl.out | 38 ++++++++++++++++++++++++++++------
 contrib/test_decoding/sql/ddl.sql      | 15 ++++++++++++--
 doc/src/sgml/catalogs.sgml             | 10 +++++++++
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/replication/slotfuncs.c    |  6 +++++-
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  3 ++-
 7 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 6353930..da713df 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -61,11 +61,37 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
-    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal 
------------------+---------------+-----------+--------+------------------+-------------------+----------
- regression_slot | test_decoding | logical   | f      | t                | t                 | t
+    slot_name    |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+-----------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ regression_slot | test_decoding | logical   | f      | t                | t                 | t        | f
+(1 row)
+
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+ ?column? 
+----------
+ init
+(1 row)
+
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+   slot_name   |    plugin     | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal | failover 
+---------------+---------------+-----------+--------+------------------+-------------------+----------+----------
+ failover_slot | test_decoding | logical   | f      | t                | t                 | t        | t
+(1 row)
+
+SELECT pg_drop_replication_slot('failover_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
 (1 row)
 
 /*
@@ -691,7 +717,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | active | active_pid | failover | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
+-----------+--------+-----------+--------+----------+--------+------------+----------+------+--------------+-------------+---------------------
 (0 rows)
 
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 5a94747..dd0e3d6 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -24,16 +24,27 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 SELECT pg_drop_replication_slot('repl');
 
-
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
 /* check whether status function reports us, only reproduceable columns */
 SELECT slot_name, plugin, slot_type, active,
     NOT catalog_xmin IS NULL AS catalog_xmin_set,
     xmin IS NULl  AS data_xmin_not_set,
-    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
 FROM pg_replication_slots;
 
+/* same for a failover slot */
+SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', true);
+SELECT slot_name, plugin, slot_type, active,
+    NOT catalog_xmin IS NULL AS catalog_xmin_set,
+    xmin IS NULl  AS data_xmin_not_set,
+    pg_xlog_location_diff(restart_lsn, '0/01000000') > 0 AS some_wal,
+    failover
+FROM pg_replication_slots
+WHERE slot_name = 'failover_slot';
+SELECT pg_drop_replication_slot('failover_slot');
+
 /*
  * Check that changes are handled correctly when interleaved with ddl
  */
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..0a3af1f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5377,6 +5377,16 @@
      </row>
 
      <row>
+      <entry><structfield>failover</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       True if this slot is a failover slot; see
+       <xref linkend="streaming-replication-slots-failover"/>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>xmin</structfield></entry>
       <entry><type>xid</type></entry>
       <entry></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 593b3e9..7dc3cab 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,7 @@ CREATE VIEW pg_replication_slots AS
             D.datname AS database,
             L.active,
             L.active_pid,
+            L.failover,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index a2dfc40..abc450d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -177,7 +177,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 10
+#define PG_GET_REPLICATION_SLOTS_COLS 11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -227,6 +227,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		XLogRecPtr	restart_lsn;
 		XLogRecPtr	confirmed_flush_lsn;
 		pid_t		active_pid;
+		bool		failover;
 		Oid			database;
 		NameData	slot_name;
 		NameData	plugin;
@@ -249,6 +250,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			namecpy(&plugin, &slot->data.plugin);
 
 			active_pid = slot->active_pid;
+			failover = slot->data.failover;
 		}
 		SpinLockRelease(&slot->mutex);
 
@@ -279,6 +281,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
+		values[i++] = BoolGetDatum(failover);
+
 		if (xmin != InvalidTransactionId)
 			values[i++] = TransactionIdGetDatum(xmin);
 		else
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0b3d7ed..44bf51c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5099,7 +5099,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
+DATA(insert OID = 3781 (  pg_get_replication_slots	PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 "" "{19,19,25,26,16,23,16,28,28,3220,3220}" "{o,o,o,o,o,o,o,o,o,o,o}" "{slot_name,plugin,slot_type,datoid,active,active_pid,failover,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}" _null_ _null_ pg_get_replication_slots _null_ _null_ _null_ ));
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19 16" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,failover,slot_name,xlog_position}" _null_ _null_ pg_create_logical_replication_slot _null_ _null_ _null_ ));
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 79f9b23..8dbbced 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1417,11 +1417,12 @@ pg_replication_slots| SELECT l.slot_name,
     d.datname AS database,
     l.active,
     l.active_pid,
+    l.failover,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
     l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, active_pid, failover, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.1.0

0007-Introduce-TAP-recovery-tests-for-failover-slots.patchtext/x-patch; charset=US-ASCII; name=0007-Introduce-TAP-recovery-tests-for-failover-slots.patchDownload
From 42f2103ebc11c4fa01c3efcbefec6c45cdd67093 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Thu, 10 Mar 2016 10:50:59 +0800
Subject: [PATCH 7/7] Introduce TAP recovery tests for failover slots

---
 src/test/recovery/t/007_failover_slots.pl | 644 ++++++++++++++++++++++++++++++
 1 file changed, 644 insertions(+)
 create mode 100644 src/test/recovery/t/007_failover_slots.pl

diff --git a/src/test/recovery/t/007_failover_slots.pl b/src/test/recovery/t/007_failover_slots.pl
new file mode 100644
index 0000000..b8ebda0
--- /dev/null
+++ b/src/test/recovery/t/007_failover_slots.pl
@@ -0,0 +1,644 @@
+#
+# Test failover slots
+#
+use strict;
+use warnings;
+use bigint;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use RecursiveCopy;
+use File::Copy;
+use File::Basename qw(basename);
+use List::Util qw();
+use Data::Dumper;
+use IPC::Run qw();
+
+
+use Carp 'verbose';
+$SIG{ __DIE__ } = sub { Carp::confess( @_ ) };
+
+my $verbose = 0;
+
+sub lsn_to_bigint
+{
+	my ($lsn) = @_;
+	my ($high, $low) = split("/",$lsn);
+	return hex($high) * 2**32 + hex($low);
+}
+
+sub get_slot_info
+{
+	my ($node, $slot_name) = @_;
+
+	my $esc_slot_name = $slot_name;
+	$esc_slot_name =~ s/'/''/g;
+	my @selectlist = ('slot_name', 'plugin', 'slot_type', 'database', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn', 'confirmed_flush_lsn', 'failover');
+	my $row = $node->safe_psql('postgres', "SELECT " . join(', ', @selectlist) . " FROM pg_replication_slots WHERE slot_name = '$esc_slot_name';");
+	chomp $row;
+	my @fields = split('\|', $row, -1);
+	if (scalar @fields != scalar @selectlist)
+	{
+		diag "Invalid row is: '$row'";
+		die "Select-list '@selectlist'(".scalar(@selectlist).") didn't match length of result-list '@fields'(".scalar(@fields).")";
+	}
+	my %slotinfo;
+	for (my $i = 0; $i < scalar @selectlist; $i++)
+	{
+		$slotinfo{$selectlist[$i]} = $fields[$i];
+	}
+	return \%slotinfo;
+}
+
+sub diag_slotinfo
+{
+	my ($info, $msg) = @_;
+	return unless $verbose;
+	diag "slot " . $info->{slot_name} . ": " . Dumper($info);
+}
+
+sub wait_for_catchup
+{
+	my ($node_master, $node_replica) = @_;
+
+	my $master_lsn = $node_master->safe_psql('postgres', 'SELECT pg_current_xlog_insert_location()');
+	diag "waiting for " . $node_replica->name . " to catch up to $master_lsn on " . $node_master->name if $verbose;
+	my $ret = $node_replica->poll_query_until('postgres',
+		"SELECT pg_last_xlog_replay_location() >= '$master_lsn'::pg_lsn;");
+	BAIL_OUT('replica failed to catch up') unless $ret;
+	my $replica_lsn = $node_replica->safe_psql('postgres', 'SELECT pg_last_xlog_replay_location()');
+	diag "Replica is caught up to $replica_lsn, past required LSN $master_lsn" if $verbose;
+}
+
+sub read_slot_updates_from_xlog
+{
+	my ($node, $timeline) = @_;
+	my ($stdout, $stderr) = ('', '');
+	# Look at master xlogs and examine sequence advances
+	my $wal_pattern = sprintf("%s/pg_xlog/%08X" . ("?" x 16), $node->data_dir, $timeline);
+	my @wal = glob $wal_pattern;
+	my $firstwal = List::Util::minstr(@wal);
+	my $lastwal = basename(List::Util::maxstr(@wal));
+	diag "decoding xlog on " . $node->name . " from $firstwal to $lastwal" if $verbose;
+	IPC::Run::run ['pg_xlogdump', $firstwal, $lastwal], '>', \$stdout, '2>', \$stderr;
+	like($stderr, qr/invalid record length at [0-9A-F]+\/[0-9A-F]+: wanted 24, got 0/,
+		'pg_xlogdump exits with expected error');
+	my @slots = grep(/ReplicationSlot/, split(/\n/, $stdout));
+
+	# Parse the dumped xlog data
+	my @slot_updates = ();
+	for my $slot (@slots) {
+		if (my @matches = ($slot =~ /lsn: ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}), prev [[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}, desc: UPDATE of slot (\w+) with restart ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8}) and xid ([[:digit:]]+) confirmed to ([[:xdigit:]]{1,8}\/[[:xdigit:]]{1,8})/))
+		{
+			my %slot_update = (
+				action => 'UPDATE',
+				log_lsn => $1, slot_name => $2, restart_lsn => $3,
+				xid => $4, confirm_lsn => $5
+				);
+			diag "Replication slot create/advance: $slot_update{slot_name} advanced to $slot_update{confirm_lsn} with restart $slot_update{restart_lsn} and $slot_update{xid} in xlog entry $slot_update{log_lsn}" if $verbose;
+			push @slot_updates, \%slot_update;
+		}
+		elsif ($slot =~ /DELETE/)
+		{
+			diag "Replication slot delete: $slot" if $verbose;
+		}
+		else
+		{
+			die "Slot xlog entry didn't match pattern: $slot";
+		}
+	}
+	return \@slot_updates;
+}
+
+sub check_slot_wal_update
+{
+	my ($entry, $slotname, %params) = @_;
+
+	ok(defined($entry), 'xlog entry exists for slot $slotname');
+	SKIP: {
+		skip 'Expected xlog entry was undef' unless defined($entry);
+		my %entry = %{$entry}; undef $entry;
+		diag "Examining decoded slot update xlog entry: " . Dumper(\%entry) if $verbose;
+		is($entry{action}, 'UPDATE', "$slotname: action is an update");
+		is($entry{slot_name}, $slotname, "$slotname: action affects slot " . $slotname);
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "le",
+		       lsn_to_bigint($entry{log_lsn}),
+		       "$slotname: restart_lsn is no greater than LSN when logged");
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "le",
+		       lsn_to_bigint($entry{log_lsn}),
+		       "$slotname: confirm_lsn is no greater than LSN when logged");
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "ge",
+			lsn_to_bigint($entry{restart_lsn}),
+			"$slotname: confirm_lsn equal to or ahead of restart_lsn")
+		      if $entry{confirm_lsn} && $entry{confirm_lsn} ne '0/0';
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "le",
+			lsn_to_bigint($params{expect_max_restart_lsn}),
+			"$slotname: restart_lsn is at or before expected")
+			if ($params{expect_max_restart_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{restart_lsn}), "ge",
+			lsn_to_bigint($params{expect_min_restart_lsn}),
+			"$slotname: restart_lsn is at or after expected")
+			if ($params{expect_min_restart_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "le",
+			lsn_to_bigint($params{expect_max_confirm_lsn}),
+			"$slotname: confirm_lsn is at or before expected")
+			if ($params{expect_max_confirm_lsn});
+
+		cmp_ok(lsn_to_bigint($entry{confirm_lsn}), "ge",
+			lsn_to_bigint($params{expect_min_confirm_lsn}),
+			"$slotname: confirm_lsn is at or after expected")
+			if ($params{expect_min_confirm_lsn});
+	}
+}
+
+sub test_read_from_slot
+{
+	my ($node, $slot, $expected) = @_;
+	my $slot_quoted = $slot;
+	$slot_quoted =~ s/'/''/g;
+	my ($ret, $stdout, $stderr) = $node->psql('postgres',
+		"SELECT data FROM pg_logical_slot_peek_changes('$slot_quoted', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
+	);
+	is($ret, 0, "replaying from slot $slot is successful");
+	is($stderr, '', "replay from slot $slot produces no stderr");
+	if (defined($expected)) {
+		is($stdout, $expected, "slot $slot returned expected output");
+	}
+	return $stderr;
+}
+
+sub wait_for_end_of_recovery
+{
+	my ($node) = @_;
+	$node->poll_query_until('postgres',
+		"SELECT NOT pg_is_in_recovery();");
+}
+
+# Launch pg_xlogdump as a background proc and return the IPC::Run handle for it
+# as well as the proc's stdout and stderr scalar refs as well as the path to
+# where the xlogs are written.
+sub start_pg_receivexlog
+{
+  my ($node, $slotname) = @_;
+  my ($stdout, $stderr);
+
+  my $outdir = $node->basedir . '/xl_' . $slotname;
+  mkdir($outdir);
+
+  my @cmd = ("pg_receivexlog", "--verbose", "-S", $slotname, "-D", $outdir, "--no-loop", "--dbname", $node->connstr);
+  diag "Running '@cmd'" if $verbose;
+
+  my $proc = IPC::Run::start \@cmd, '>', \$stdout, '2>', \$stderr;
+
+  die $! unless defined($proc);
+
+  return ($proc, \$stdout, \$stderr, $outdir);
+}
+
+sub test_phys_replay
+{
+  my ($node, $slotname, $start_tli) = @_;
+  my ($recvxlog, $stdout, $stderr, $outdir) = start_pg_receivexlog($node, $slotname);
+  # pg_receivexlog doesn't give us a --nowait option so we have to just wait a
+  # bit then kill it.
+  sleep(1);
+  $recvxlog->signal("TERM");
+  sleep(1);
+  $recvxlog->finish;
+  # FIXME: Not portable, we should use IPC::Signal but that's in CPAN because
+  # apparently Perl doesn't have a signo/signame mapping built-in. WTF...
+  is($recvxlog->full_result, "15", 'pg_recvlog exited due to SIGTERM');
+  chomp $$stderr;
+  my $expected_stderr_re = "^pg_receivexlog: starting log streaming at ([[:xdigit:]]{1,8})/([[:xdigit:]]{1,8}) \\(timeline ($start_tli)\\)";
+  like($$stderr, "/$expected_stderr_re/", "reported start location to stderr");
+  if ($$stderr =~ $expected_stderr_re)
+  {
+    my ($cap_lsn_high, $cap_lsn_low, $cap_tli) = ($1, $2, $3);
+    diag "pg_xlogdump streamed xlog from node " . $node->name . " starting at $cap_lsn_high/$cap_lsn_low on timeline $cap_tli" if $verbose;
+    is($cap_tli, $start_tli, 'replay started on expected timeline') if ($start_tli);
+  }
+  is($$stdout, '', "no stdout");
+  my @xlogs = glob $outdir . "/*";
+  cmp_ok(scalar(@xlogs), "ge", 1, "Received at least one segment from $slotname");
+}
+
+
+my ($stdout, $stderr, $ret, $slotinfo, $outdir, $proc);
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', "wal_level = 'logical'\n");
+$node_master->append_conf('postgresql.conf', "max_replication_slots = 8\n");
+$node_master->append_conf('postgresql.conf', "max_wal_senders = 8\n");
+#$node_master->append_conf('postgresql.conf', "log_min_messages = 'debug2'\n");
+$node_master->dump_info;
+$node_master->start;
+
+my $master_beforecreate_bb_lsn = $node_master->safe_psql('postgres',
+	"SELECT pg_current_xlog_insert_location()");
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('bb_failover', 'test_decoding', true);"
+);
+my $bb_beforeconsume_si = get_slot_info($node_master, 'bb_failover');
+diag_slotinfo $bb_beforeconsume_si, 'bb_beforeconsume';
+
+# Create non-failover slot to make sure it isn't replicated
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('bb', 'test_decoding');"
+);
+
+# Failover slots work for physical slots too
+$node_master->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('bb_phys_failover', false, true);");
+$node_master->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('bb_phys');");
+
+my $bb_phys_beforeconsume_si = get_slot_info($node_master, 'bb_phys_failover');
+diag_slotinfo $bb_phys_beforeconsume_si, 'bb_phys_beforeconsume';
+
+$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('consumed');");
+($ret, $stdout, $stderr) = $node_master->psql('postgres',
+	"SELECT data FROM pg_logical_slot_get_changes('bb_failover', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is($ret, 0, 'replaying from bb_failover on master is successful');
+is( $stdout, q(BEGIN
+table public.decoding: INSERT: blah[text]:'consumed'
+COMMIT), 'decoded expected data from slot bb_failover on master');
+is($stderr, '', 'replay from slot bb_failover produces no stderr');
+
+my $bb_afterconsume_si = get_slot_info($node_master, 'bb_failover');
+diag_slotinfo $bb_afterconsume_si, 'bb_afterconsume';
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres',
+	"SELECT data FROM pg_logical_slot_get_changes('bb_failover', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is ($ret, 0, 'no error reading empty slot changes after get');
+is ($stdout, '', 'no new changes to read from slot after get');
+
+cmp_ok(lsn_to_bigint($bb_afterconsume_si->{confirmed_flush_lsn}),
+      "gt",
+      lsn_to_bigint($bb_beforeconsume_si->{confirmed_flush_lsn}),
+      "confirm lsn on bb_failover advanced on master after replay");
+
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('beforebb');");
+$node_master->safe_psql('postgres', 'CHECKPOINT;');
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+my $master_beforecreate_ab_lsn = $node_master->safe_psql('postgres',
+	"SELECT pg_current_xlog_insert_location()");
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('ab_failover', 'test_decoding', true);"
+);
+
+my $ab_beforeconsume_si = get_slot_info($node_master, 'ab_failover');
+diag_slotinfo $ab_beforeconsume_si, 'ab_beforeconsume';
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('ab', 'test_decoding');"
+);
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('ab_phys_failover', false, true);"
+);
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('ab_phys');"
+);
+
+my $ab_phys_beforeconsume_si = get_slot_info($node_master, 'ab_phys_failover');
+diag_slotinfo $ab_phys_beforeconsume_si, 'ab_phys_beforeconsume';
+
+# We can also create physical slots on replicas if they aren't failover slots
+$node_replica->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('onreplica');"
+);
+
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+"SELECT pg_create_physical_replication_slot('onreplica', false, true);"
+);
+is($ret, 3, "failed to create failover slot on replica");
+like($stderr, qr/a failover slot may not be created on a replica/, "got expected error creating failover slot on replica");
+
+$node_master->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('afterbb');");
+
+wait_for_catchup($node_master, $node_replica);
+
+# Can't replay from a failover slot on a replica
+($proc, $stdout, $stderr, $outdir) = start_pg_receivexlog($node_replica, 'bb_phys_failover');
+$proc->finish;
+is($proc->result, 1, 'pg_receivexlog exited with error code when attempting replay from failover slot on replica');
+is($$stdout, '', 'no stdout');
+like($$stderr, qr/ERROR:.*replication slot "bb_phys_failover" is reserved for use after failover/, 'pg_receivexlog exited with expected error');
+
+$stdout = $node_master->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, q(ab
+ab_failover
+ab_phys
+ab_phys_failover
+bb
+bb_failover
+bb_phys
+bb_phys_failover), 'Expected slots exist on master')
+  or BAIL_OUT('Remaining tests meaningless');
+
+
+# Verify that only the failover slots and the physical slot we created
+# directly are present on the replica
+$stdout = $node_replica->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, q(ab_failover
+ab_phys_failover
+bb_failover
+bb_phys_failover
+onreplica), 'Expected slots exist on replica')
+  or BAIL_OUT('Remaining tests meaningless');
+
+# Make sure we can replay from the physical failover slot on the master
+my $master_beforereplay_bbphys_si = get_slot_info($node_master, 'bb_phys_failover');
+is($master_beforereplay_bbphys_si->{restart_lsn}, '',
+  'restart_lsn on slot bb_phys_failover is empty before replay');
+test_phys_replay($node_master, 'bb_phys_failover', 1);
+my $master_afterreplay_bbphys_si = get_slot_info($node_master, 'bb_phys_failover');
+
+cmp_ok(lsn_to_bigint($master_afterreplay_bbphys_si->{restart_lsn}),
+       "gt",
+       0,
+       "bb_phys_failover restart_lsn advanced after replay");
+
+$node_master->stop('fast');
+
+my $log = TestLib::slurp_file($node_master->logfile);
+unlike($log, '/PANIC:/', 'No PANIC in master logs');
+
+my @slot_updates = @{ read_slot_updates_from_xlog($node_master, 1) };
+
+#
+# Decode the WAL from the master and make sure the expected entries and only the
+# expected entries are present.
+#
+# We want to see two WAL entries, one for each slot. There won't be another entry
+# for the slot advance because right now we don't write out WAL when a slot's confirmed
+# location advances, only when the flush location or xmin advance. The restart lsn
+# and confirmed flush LSN in the slot's WAL record must not be less than the LSN
+# of the master before we created the slot and not greater than the position we saw
+# in pg_replication_slots after slot creation.
+#
+
+# bb_failover created
+check_slot_wal_update($slot_updates[0], 'bb_failover',
+	expect_min_restart_lsn => $master_beforecreate_bb_lsn,
+	expect_min_confirm_lsn => $master_beforecreate_bb_lsn,
+	expect_max_restart_lsn => $bb_beforeconsume_si->{restart_lsn},
+	expect_max_confirm_lsn => $bb_beforeconsume_si->{confirmed_flush_lsn});
+
+# bb_phys_failover created
+check_slot_wal_update($slot_updates[1], 'bb_phys_failover',
+	expect_min_restart_lsn => '0/0',
+	expect_min_confirm_lsn => '0/0',
+	expect_max_restart_lsn => '0/0',
+	expect_max_confirm_lsn => '0/0');
+
+# bb_failover updated after replay. This only happens because we
+# force a checkpoint to flush the dirtied but not written-out
+# slot.
+check_slot_wal_update($slot_updates[2], 'bb_failover',
+	expect_min_restart_lsn => $master_beforecreate_bb_lsn,
+	expect_min_confirm_lsn => $master_beforecreate_bb_lsn,
+	expect_max_restart_lsn => $bb_afterconsume_si->{restart_lsn},
+	expect_max_confirm_lsn => $bb_afterconsume_si->{confirmed_flush_lsn});
+
+# Creation of ab_failover
+check_slot_wal_update($slot_updates[3], 'ab_failover',
+	expect_min_restart_lsn => $master_beforecreate_ab_lsn,
+	expect_min_confirm_lsn => $master_beforecreate_ab_lsn,
+	expect_max_restart_lsn => $ab_beforeconsume_si->{restart_lsn},
+	expect_max_confirm_lsn => $ab_beforeconsume_si->{confirmed_flush_lsn});
+
+# Creation of ab_phys_failover
+check_slot_wal_update($slot_updates[4], 'ab_phys_failover',
+	expect_min_restart_lsn => '0/0',
+	expect_min_confirm_lsn => '0/0',
+	expect_max_restart_lsn => '0/0',
+	expect_max_confirm_lsn => '0/0');
+
+# created after we replayed from bb_failover on the master
+check_slot_wal_update($slot_updates[5], 'bb_phys_failover',
+	expect_min_restart_lsn => $master_afterreplay_bbphys_si->{restart_lsn},
+	expect_min_confirm_lsn => '0/0',
+	expect_max_restart_lsn => $master_afterreplay_bbphys_si->{restart_lsn},
+	expect_max_confirm_lsn => '0/0');
+
+# Consuming from a slot does not cause the slot to be written out even on
+# CHECKPOINT. Since nothing else would have dirtied the slot, there should
+# be no more WAL entries for failover slots.
+#
+# The client is expected to keep track of the confirmed LSN and skip replaying
+# data it's already seen.
+ok(!defined($slot_updates[6]), 'No more slot updates');
+
+
+
+# Can replay from physical failover slot on promoted replica
+
+
+$node_replica->promote;
+
+wait_for_end_of_recovery($node_replica);
+
+$node_replica->safe_psql('postgres',
+	"INSERT INTO decoding(blah) VALUES ('after failover');");
+
+my $bb_afterpromote_si = get_slot_info($node_replica, 'bb_failover');
+diag_slotinfo $bb_afterpromote_si, 'bb_afterpromote';
+# Because we forced a checkpoint to flush the slot to disk after replaying from
+# bb_failover it should have the new confirmed flush point on the replica.
+is($bb_afterpromote_si->{confirmed_flush_lsn}, $bb_afterconsume_si->{confirmed_flush_lsn},
+	'slot bb_failover confirmed pos on replica matches master');
+# We haven't replayed much, so the restartpoint probably didn't change, but
+# it should be wherever it was after we replayed anyway.
+is($bb_afterpromote_si->{restart_lsn}, $bb_afterconsume_si->{restart_lsn},
+	'slot bb_failover restart pos on replica matches master');
+
+# We never replayed from the after-basebackup slot on the master so it
+# should be right where it was created.
+my $ab_afterpromote_si = get_slot_info($node_replica, 'ab_failover');
+diag_slotinfo $ab_afterpromote_si, 'ab_afterpromote';
+is($ab_afterpromote_si->{confirmed_flush_lsn}, $ab_beforeconsume_si->{confirmed_flush_lsn},
+	'slot ab_failover confirmed pos is unchanged');
+is($ab_afterpromote_si->{restart_lsn}, $ab_beforeconsume_si->{restart_lsn},
+	'slot ab_failover restart pos is unchanged');
+
+
+
+
+# Can replay from slot ab, following the timeline switch
+test_read_from_slot($node_replica, 'ab_failover', q(BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT));
+
+# Can replay from slot bb too, and we only see data after
+# what we replayed on the master.
+#
+# Note that if we didn't force a checkpoint on the master then did an unclean
+# shutdown we would expect to see data that we already replayed on the master
+# here.  The confirm lsn wouldn't be flushed on the master and would therefore
+# effectively go backwards on failover.
+#
+# See http://www.postgresql.org/message-id/CAMsr+YGSaTRGqPcx9qx4eOcizWsa27XjKEiPSOtAJE8OfiXT-g@mail.gmail.com
+#
+test_read_from_slot($node_replica, 'bb_failover', q(BEGIN
+table public.decoding: INSERT: blah[text]:'beforebb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'afterbb'
+COMMIT
+BEGIN
+table public.decoding: INSERT: blah[text]:'after failover'
+COMMIT));
+
+
+# Can replay from physical failover slot on promoted replica
+test_phys_replay($node_replica, 'bb_phys_failover', 2);
+
+$node_replica->stop('fast');
+
+my $log = TestLib::slurp_file($node_replica->logfile);
+unlike($log, '/PANIC:/', 'No PANIC in replica logs');
+
+# We don't need the standby anymore
+$node_replica->teardown_node();
+
+
+
+# Now make sure slot drop works correctly and replays correctly by restoring
+# a fresh backup of the standby and having it replay the slot drops. We'll
+# also test dropping a physical slot that's currently in-use.
+$node_master->start;
+
+# restore the replica again
+$node_replica = get_new_node('replica2');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->start;
+
+
+# start pg_receivexlog from a local slot on the replica. Then create a failover
+# slot with the same name on the master. pg_receivexlog will be automatically
+# killed when we drop the slot it's replaying from and replace it with a failover
+# slot.
+$node_replica->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('replace_me', false, false);");
+
+my $si = get_slot_info($node_replica, 'replace_me');
+diag_slotinfo($si);
+is($si->{failover}, 'f', 'created as slot replace_me as non-failover');
+
+($proc, $stdout, $stderr, $outdir) = start_pg_receivexlog($node_replica, 'replace_me');
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_physical_replication_slot('replace_me', false, true);");
+
+wait_for_catchup($node_master, $node_replica);
+
+# pg_receivexlog should've died
+$proc->finish;
+is($proc->result, 1, 'pg_receivexlog exited with error code after its slot was dropped');
+like($$stdout, '', 'no stdout');
+like($$stderr, qr/by administrative command/, 'pg_receivexlog exited with admin command');
+
+# The slot is now a failover slot
+$si = get_slot_info($node_replica, 'replace_me');
+is($si->{failover}, 't', 'failover slot successfully replaces local slot');
+
+# OK, make sure slot drops replay correctly
+
+$node_master->safe_psql("postgres", "SELECT pg_drop_replication_slot('bb_failover');");
+$node_master->safe_psql("postgres", "SELECT pg_drop_replication_slot('ab_failover');");
+$node_master->safe_psql("postgres", "SELECT pg_drop_replication_slot('bb_phys_failover');");
+$node_master->safe_psql("postgres", "SELECT pg_drop_replication_slot('ab_phys_failover');");
+$node_master->safe_psql("postgres", "SELECT pg_drop_replication_slot('replace_me');");
+
+wait_for_catchup($node_master, $node_replica);
+
+
+$stdout = $node_replica->safe_psql('postgres', 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
+is($stdout, '', 'No slots exist on replica')
+  or BAIL_OUT('Remaining tests meaningless');
+
+
+# OK, now we need to test replay of a big enough chunk of data to advance the restart_lsn
+# and make the master do a checkpoint.
+#
+# We create two copies of the slot so we can advance one of them and get the changes
+# checkpointed out, while leaving the other unchanged for replay after failover.
+# This just lets us test two things in one: checkpointing of failover slots and
+# failover with big chunks of data.
+
+$node_master->safe_psql('postgres',
+"SELECT pg_create_logical_replication_slot('big', 'test_decoding', true); SELECT pg_create_logical_replication_slot('big_adv', 'test_decoding', true);"
+);
+
+$node_master->safe_psql('postgres',
+  "CREATE TABLE big_inserts (id serial primary key, text padding);"
+);
+
+$node_master->safe_psql('postgres',
+  "INSERT into big_inserts(padding) SELECT repeat('x', n % 100) FROM generate_series(1, 1000000) n;"
+);
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres',
+	"SELECT data FROM pg_logical_slot_get_changes('big_adv', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is($ret, 0, 'replaying from slot big_adv on master is successful');
+my $data_replayed_from_master = $stdout;
+is($stderr, '', 'replay from slot big_adv produces no stderr');
+
+wait_for_catchup($node_master, $node_replica);
+$node_master->stop('fast');
+$node_replica->promote;
+wait_for_end_of_recovery($node_replica);
+
+($ret, $stdout, $stderr) = $node_replica->psql('postgres',
+	"SELECT data FROM pg_logical_slot_peek_changes('big', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+is($ret, 0, 'replaying from slot big on replica is successful');
+is($stdout, $data_replayed_from_master, 'Got same data from replica as master');
+is($stderr, '', 'replay from slot big produces no stderr');
+
+$node_replica->stop('fast');
+
+# Make sure there's no crash complaint in the master or replica logs
+my $log = TestLib::slurp_file($node_master->logfile);
+unlike($log, '/PANIC:/', 'No PANIC in master logs');
+
+my $log = TestLib::slurp_file($node_replica->logfile);
+unlike($log, '/PANIC:/', 'No PANIC in replica logs');
+
+$node_master->teardown_node;
+$node_replica->teardown_node;
+
+done_testing();
-- 
2.1.0

#26Oleksii Kliukin
alexk@hintbits.com
In reply to: Craig Ringer (#25)
Re: WIP: Failover Slots

Hi,

On 17 Mar 2016, at 09:34, Craig Ringer <craig@2ndquadrant.com> wrote:

OK, here's the latest failover slots patch, rebased on top of today's master plus, in order:

- Dirty replication slots when confirm_lsn is changed
(/messages/by-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com </messages/by-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com&gt;)

- logical decoding timeline following
(/messages/by-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com </messages/by-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com&gt;)

The full tree is at https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots <https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots&gt; if you want to avoid the fiddling around required to apply the patch series.

<0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>

Thank you for the update. I’ve got some rejects when applying the 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty replication slots when confirm_lsn is changed” changes. I think it should be rebased against the master, (might be the consequence of the "logical slots follow the timeline” patch committed).

patch -p1 <~/git/pg/patches/failover-slots/v6/0001-Allow-replication-slots-to-follow-failover.patch
patching file src/backend/access/rmgrdesc/Makefile
Hunk #1 FAILED at 10.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/rmgrdesc/Makefile.rej
patching file src/backend/access/rmgrdesc/replslotdesc.c
patching file src/backend/access/transam/rmgr.c
Hunk #1 succeeded at 25 (offset 1 line).
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 6351 (offset 3 lines).
Hunk #2 succeeded at 8199 (offset 14 lines).
Hunk #3 succeeded at 8645 (offset 14 lines).
Hunk #4 succeeded at 8718 (offset 14 lines).
patching file src/backend/commands/dbcommands.c
patching file src/backend/replication/basebackup.c
patching file src/backend/replication/logical/decode.c
Hunk #1 FAILED at 143.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/replication/logical/decode.c.rej
patching file src/backend/replication/logical/logical.c
patching file src/backend/replication/slot.c
patching file src/backend/replication/slotfuncs.c
patching file src/backend/replication/walsender.c
patching file src/bin/pg_xlogdump/replslotdesc.c
patching file src/bin/pg_xlogdump/rmgrdesc.c
Hunk #1 succeeded at 27 (offset 1 line).
patching file src/include/access/rmgrlist.h
Hunk #1 FAILED at 45.
1 out of 1 hunk FAILED -- saving rejects to file src/include/access/rmgrlist.h.rej
patching file src/include/replication/slot.h
patching file src/include/replication/slot_xlog.h
can't find file to patch at input line 1469
Perhaps you used the wrong -p or --strip option?

--
Oleksii

#27Craig Ringer
craig@2ndquadrant.com
In reply to: Oleksii Kliukin (#26)
Re: WIP: Failover Slots

On 5 April 2016 at 04:19, Oleksii Kliukin <alexk@hintbits.com> wrote:

Thank you for the update. I’ve got some rejects when applying the
0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty
replication slots when confirm_lsn is changed” changes. I think it should
be rebased against the master, (might be the consequence of the "logical
slots follow the timeline” patch committed).

I'll rebase it on top of the new master after timeline following for
logical slots got committed and follow up shortly.

That said, I've marked this patch 'returned with feedback' in the CF. It
should possibly actually be 'rejected' given the discussion on the logical
decoding timeline following thread, which points heavily at a different
approach to solving this problem in 9.7.

That doesn't mean nobody can pick it up if they think it's valuable and
want to run with it, but we're very close to feature freeze now.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Craig Ringer (#9)
Re: WIP: Failover Slots

On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:

I'd like to get failover slots in place for 9.6 since the're fairly
self-contained and meet an immediate need: allowing replication using slots
(physical or logical) to follow a failover event.

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now
follow a failover event?

There are many and varied users of logical decoding now and a fix is
critically important for 9.6.

Do all decoding plugins need to write their own support code??

Please explain how we cope without this, so if a problem remains we can fix
by the freeze.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#29Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#28)
Re: WIP: Failover Slots

On 6 April 2016 at 17:43, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:

I'd like to get failover slots in place for 9.6 since the're fairly
self-contained and meet an immediate need: allowing replication using slots
(physical or logical) to follow a failover event.

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now
follow a failover event?

There are many and varied users of logical decoding now and a fix is
critically important for 9.6.

I agree with you, but I haven't been able to convince enough people of that.

Do all decoding plugins need to write their own support code??

We'll be able to write a bgworker based extension that handles it by
running in the standby. So no, I don't think so.

Please explain how we cope without this, so if a problem remains we can
fix by the freeze.

The TL;DR: Create a slot on the master to hold catalog_xmin where the
replica needs it. Advance it using client or bgworker on replica based on
the catalog_xmin of the oldest slot on the replica. Copy slot state from
the master using an extension that keeps the slots on the replica
reasonably up to date.

All of this is ugly workaround for not having true slot failover support.
I'm not going to pretend it's nice, or anything that should go anywhere
near core. Petr outlined the approach we want to take for core in 9.7 on
the logical timeline following thread.

Details:

Logical decoding on a slot can follow timeline switches now - or rather,
the xlogreader knows how to follow timeline switches, and the read page
callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so
they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to
be ready for failover:

- it has no way to create a new slot at a consistent point on the replica
since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since
decoding isn't permitted on a replica, so it can't just decode from the
replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still
needed by the slot's catalog_xmin since catalog_xmin isn't propagated from
standby to master.

So we have to help the client out. To do so, we have a
function/worker/whatever on the replica that grabs the slot state from the
master and copies it to the replica, and we have to hold the master's
catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical
slot on the master, maintained by a function/bgworker/whatever on the
replica. It gets advanced so that its restart_lsn and catalog_xmin are
those of the oldest slot on the replica. We can do that by requesting
replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the
replica. Ugly, but workable. Or we can abuse the infrastructure more deeply
by simply setting the catalog_xmin and restart_lsn on the slot directly,
but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can
create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which
will appear to work fine but will be subtly broken. Since the replica is an
identical copy of the master we know that a slot state that was valid on
the master at a given xlog insert lsn is also valid on the replica at the
same replay lsn, but we've got no reliable way to ensure that when the
master updates a slot at LSN A/B the replica also updates the slot at
replay of LSN A/B. That's what failover slots did. Without that we need to
use some external channel - but there's no way to capture knowledge of "at
exactly LSN A/B, master saved a new copy of slot X" since we can't hook
ReplicationSlotSave(). At least we *can* now inject slot state updates as
generic WAL messages though, so we can ensure they happen at exactly the
desired point in replay.

As Andres explained on the timeline following thread it's not safe for the
slot on the replica to be behind the state the slot on the master was at
the same LSN. At least unless we can protect catalog_xmin via some other
mechanism so we can make sure no catalogs still needed by the slots on the
replica are vacuumed away. It's vital that the catalog_xmin of any slots on
the replica be >= the catalog_xmin the master had for the lowest
catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice
that a slot has changed we'll dump it into xlog via the generic xlog
mechanism to be applied on the replica, much like failover slots. The slot
update might arrive a bit late on the replica, but that's OK because we're
holding catalog_xmin pinned on the master using the dummy slot.

I don't like it, but I don't have anything better for 9.6.

I'd really like to be able to build a more solid proof of concept that
tests this with a lagging replica, but -ENOTIME before FF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#29)
Re: WIP: Failover Slots

A few thoughts on failover slots vs the alternative of pushing catalog_xmin
up to the master via a replica's slot and creating independent slots on
replicas.

Failover slots:
---

+ Failover slots are very easy for applications. They "just work" and are
transparent for failover. This is great especially for things that aren't
complex replication schemes, that just want to use logical decoding.

+ Applications don't have to know what replicas exist or be able to reach
them; transparent failover is easier.

- Failover slots can't be used from a cascading standby (where we can fail
down to the standby's own replicas) because they have to write WAL to
advance the slot position. They'd have to send the slot position update
"up" to the master then wait to replay it. Not a disaster, though they'd do
extra work on reconnect until a restart_lsn update replayed. Would require
a whole new feedback-like message on the rep protocol, and couldn't work at
all with archive replication. Ugly as hell.

+ Failover slots exist now, and could be added to 9.6.

- The UI for failover slots can't be re-used for the catalog_xmin push-up
approach to allow replay from failover slots on cascading standbys in 9.7+.
There'd be no way to propagate the creation of failover slots "down" the
replication heirarchy that way, especially to archive standbys like
failover slots will do. So it'd be semantically different and couldn't
re-use the FS UI. We'd be stuck with failover slots even if we also did the
other way later.

+ Will work for recovery of a master PITR-restored up to the latest
recovery point

Independent slots on replicas + catalog_xmin push-up
---

With this approach we allow creation of replication slots on a replica
independently of the master. The replica is required to connect to the
master via a slot. We send feedback to the master to advance the replica's
slot on the master to the confirmed_lsn of the most-behind slot on the
replica, therefore pinning master's catalog_xmin where needed. Or we just
send a new feedback message type that directly sets a catalog_xmin on the
replica's physical slot in the master. Slots are _not_ cloned from master
to replica automatically.

- More complicated for applications to use. They have to create a slot on
each replica that might be failed over to as well as the master and have to
advance all those slots to stop the master from suffering severe catalog
bloat. (But see note below).

- Applications must be able to connect to failover-candidate standbys and
know where they are, it's not automagically handled via WAL. (But see note
below).

- Applications need reconfiguration whenever a standby is rebuilt, moved,
etc. (But see note below).

- Cannot work at all for archive-based replication, requires a slot from
replica to master.

+ Works with replay from cascading standbys

+ Actually solves one of the problems making logical slots on standbys
unsupported at the moment by giving us a way to pin the master's
catalog_xmin to that needed by a replica.

- Won't work for a standby PITR-restored up to latest.

- Vapourware with zero hope for 9.6

Note: I think the application complexity issues can be solved - to a degree
- by having the replicas run a bgworker based helper that connects to the
master and clones the master's slots then advances them automatically.

Do nothing
---

Drop the idea of being able to follow physical failover on logical slots.

I've already expressed why I think this is a terrible idea. It's hostile to
application developers who'd like to use logical decoding. It makes
integration of logical replication with existing HA systems much harder. It
means we need really solid, performant, well-tested and mature logical rep
based HA before we can take logical rep seriously, which is a long way out
given that we can't do decoding of in-progress xacts, ddl, sequences, ....
etc etc.

Some kind of physical HA for logical slots is needed and will be needed for
some time. Logical rep will be great for selective replication, replication
over WAN, filtered/transformed replication etc. Physical rep is great for
knowing you'll get exactly the same thing on the replica that you have on
the master and it'll Just Work.

In any case, "Do nothing" is the same for 9.6 as pursusing the catalog_xmin
push-up idea; in both cases we don't commit anything in 9.6.

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Craig Ringer (#30)
Re: WIP: Failover Slots

On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.

=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.

We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for open
source.

=> I think we should add Failover Slots to 9.6.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#31)
Re: WIP: Failover Slots

On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:

On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.

=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.

We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for open
source.

=> I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

I don't think this is commit-ready. For one I think this is
architecturally the wrong choice. But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove), there
appears to have been very little code level review (one early from Petr
in [1]http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com, two by Oleksii mostly focusing on error messages [2]http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com [3]http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4]http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.

[1]: http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2]: http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3]: http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4]: http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#32)
Re: WIP: Failover Slots

On 6 April 2016 at 15:17, Andres Freund <andres@anarazel.de> wrote:

On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:

On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.

=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.

We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for

open

source.

=> I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

Thanks for the review. Rational technical comments are exactly why we are
here and they are always welcome.

For one I think this is architecturally the wrong choice.

But even leaving that fact aside, and

considering this a temporary solution (we can't easily remove),

As I observed above, the alternate solution doesn't sound particularly good
either but the main point is that we wouldn't need to remove it, it can
coexist happily. I would add that I did think of the alternate solution
previously as well, this one seemed simpler, which is always key for me in
code aimed at robustness.

there appears to have been very little code level review

That is potentially fixable. At this point I don't claim it is committable,
I only say it is important and the alternate solution is not significantly
better, therefore if the patch can be beaten into shape we should commit it.

I will spend some time on this and see if we have something viable. Which
will be posted here for discussion, as would have happened even before our
other discussions.

Thanks for the points below

(one early from Petr

in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.

[1]
http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2]
http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3]
http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4]
http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#32)
Re: WIP: Failover Slots

On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

Thanks for looking at it. Most of those are my errors. I think this is
pretty dead at least for 9.6, so I'm mostly following up in the hopes of
learning about a couple of those mistakes.

Good catch with -X without a recovery.conf. Since it wouldn't be recognised
as a promotion and wouldn't increment the timeline, copied non-failover
slots wouldn't get removed. I've never liked that logic at all anyway, I
just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
support only." So that's just a dumb mistake on my part, and I should've
added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that
acquires a spinlock based on whether RecoveryInProgress()? It's not testing
RecoveryInProgress() more than once and doing the acquire and release based
on separate tests, which would be a problem. I don't really get the problem
with:

if (!RecoveryInProgress())
{
/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
SpinLockRelease(&slot->mutex);

/* and don't do anything if there's nothing to write */
if (!was_dirty)
return;
}

... though I think what I really should've done there is just always
dirty the slot in the redo functions.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#35Thom Brown
thom@linux.com
In reply to: Craig Ringer (#34)
Re: WIP: Failover Slots

On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote:

On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

Thanks for looking at it. Most of those are my errors. I think this is
pretty dead at least for 9.6, so I'm mostly following up in the hopes of
learning about a couple of those mistakes.

Good catch with -X without a recovery.conf. Since it wouldn't be recognised
as a promotion and wouldn't increment the timeline, copied non-failover
slots wouldn't get removed. I've never liked that logic at all anyway, I
just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
support only." So that's just a dumb mistake on my part, and I should've
added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that
acquires a spinlock based on whether RecoveryInProgress()? It's not testing
RecoveryInProgress() more than once and doing the acquire and release based
on separate tests, which would be a problem. I don't really get the problem
with:

if (!RecoveryInProgress())
{
/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
SpinLockRelease(&slot->mutex);

/* and don't do anything if there's nothing to write */
if (!was_dirty)
return;
}

... though I think what I really should've done there is just always dirty
the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

Thanks

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Craig Ringer
craig@2ndquadrant.com
In reply to: Thom Brown (#35)
Re: WIP: Failover Slots

On 26 July 2017 at 00:16, Thom Brown <thom@linux.com> wrote:

On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote:

On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

Thanks for looking at it. Most of those are my errors. I think this is
pretty dead at least for 9.6, so I'm mostly following up in the hopes of
learning about a couple of those mistakes.

Good catch with -X without a recovery.conf. Since it wouldn't be

recognised

as a promotion and wouldn't increment the timeline, copied non-failover
slots wouldn't get removed. I've never liked that logic at all anyway, I
just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
support only." So that's just a dumb mistake on my part, and I should've
added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that
acquires a spinlock based on whether RecoveryInProgress()? It's not

testing

RecoveryInProgress() more than once and doing the acquire and release

based

on separate tests, which would be a problem. I don't really get the

problem

with:

if (!RecoveryInProgress())
{
/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
SpinLockRelease(&slot->mutex);

/* and don't do anything if there's nothing to write */
if (!was_dirty)
return;
}

... though I think what I really should've done there is just always

dirty

the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

No. The whole approach seems to have been bounced from core. I don't agree
and continue to think this functionality is desirable but I don't get to
make that call.

If time permits I will attempt to update the logical decoding on standby
patchset instead, and if possible add support for fast-forward logical
decoding that does the minimum required to correctly maintain a slot's
catalog_xmin and restart_lsn when advanced. But this won't be usable
directly for failover like failover slots are, it'll require each
application to keep track of standbys and maintain slots on them too. Or
we'll need some kind of extension/helper to sync slot state.

In the mean time, 2ndQuadrant maintains an on-disk-compatible version of
failover slots that's available for support customers.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#37Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#36)
Re: WIP: Failover Slots

On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

No. The whole approach seems to have been bounced from core. I don't agree
and continue to think this functionality is desirable but I don't get to
make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that. Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above. It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#37)
Re: WIP: Failover Slots

On 3 August 2017 at 04:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:

No. The whole approach seems to have been bounced from core. I don't

agree

and continue to think this functionality is desirable but I don't get to
make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that. Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.

Wouldn't that immediately exclude use for PITR and snapshot recovery? I
have people right now who want the ability to promote a PITR-recovered
snapshot into place of a logical replication master and have downstream
peers replay from it. It's more complex than that, as there's a resync
process required to recover changes the failed node had sent to other peers
but isn't available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

Requiring live replication connections could also be an issue for service
interruptions, surely? Unless you persist needed knowledge in the physical
replication slot used by the standby to master connection, so the master
can tell the difference between "downstream went away for while but will
come back" and "downstream is gone forever, toss out its resources."

That's exactly what the catalog_xmin hot_standby_feedback patches in Pg10
do, but they can only tell the master about the oldest resources needed by
any existing slot on the replica. Not which slots. And they have the same
issues with needing a live, running replica.

Also, what about cascading? Lots of "pull" model designs I've looked at
tend to fall down in cascaded environments. For that matter so do failover
slots, but only for the narrower restriction of not being able to actually
decode from a failover-enabled slot on a standby, they still work fine in
terms of cascading down to leaf nodes.

- The master responds by periodically notifying the standby of changes

to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

That's pretty much what I expect to have to do for clients to work on
unpatched Pg10, probably using a separate bgworker and normal libpq
connections to the upstream since we don't have hooks to extend the
walsender/walreceiver.

It can work now that the catalog_xmin hot_standby_feedback patches are in,
but it'd require some low-level slot state setting that I know Andres is
not a fan of. So I expect to carry on relying on an out-of-tree failover
slots patch for Pg 10.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above. It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

Right. So the standby must be running and in active communication. It needs
some way to know the master has confirmed slot creation and it can rely on
the slot's resources really being reserved by the master. That turns out to
be quite hard, per the decoding on standby patches. There needs to be some
way to tell the master a standby has gone away forever and to drop its
dependent slots, so you're not stuck wondering "is slot xxyz from standby
abc that we lost in that crash?". Standbys need to cope with having created
a slot, only to find out there's a name collision with master.

For all those reasons, I just extended hot_standby_feedback to report
catalog_xmin separately to upstreams instead, so the existing physical slot
serves all these needs. And it's part of the picture, but there's no way to
get slot position change info from the master back down onto the replicas
so the replicas can advance any of their own slots and, via feedback, free
up master resources. That's where the bgworker hack to query
pg_replication_slots comes in. Seems complex, full of restrictions, and
fragile to me compared to just expecting the master to do it.

The only objection I personally understood and accepted re failover slots
was that it'd be impossible to create a failover slot on a standby and have
that standby "sub-tree" support failover to leaf nodes. Which is true, but
instead we have noting and no viable looking roadmap toward anything users
can benefit from. So I don't think that's the worst restriction in the
world.

I do not understand why logical replication slots are exempt from our usual
policy that anything that works on the master should be expected to work on
failover to a standby. Is there anything persistent across crash for which
that's not the case, except grandfathered-in hash indexes? We're hardly
going to say "hey, it's ok to forget about prepared xacts when you fail
over to a standby" yet this problem with failover and slots in logical
decoding and replication is the same sort of showstopper issue for users
who use the functionality.

In the medium term I've given up making progress with getting something
simple and usable into user hands on this. A tweaked version of failover
slots is being carried as an out-of-tree on-disk-format-compatible patch
instead, and it's meeting customer needs very well. I've done my dash here
and moved on to other things where I can make more progress.

I'd like to continue working on logical decoding on standby support for
pg11 too, but even if we can get that in place it'll only work for
reachable, online standbys. Every application that uses logical decoding
will have to maintain a directory of standbys (which it has no way to ask
the master for) and advance their slots via extra walsender connections.
They'll do a bunch of unnecessary work decoding WAL they don't need to just
to throw the data away. It won't help for PITR and snapshot use cases at
all. So for now I'm not able to allocate much priority to that.

I'd love to get failover slots in, I still think it's the simplest and best
way to do what users need. It doesn't stop us progressing with decoding on
standby or paint us into any corners.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#38)
Re: WIP: Failover Slots

On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.

Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
people right now who want the ability to promote a PITR-recovered snapshot
into place of a logical replication master and have downstream peers replay
from it. It's more complex than that, as there's a resync process required
to recover changes the failed node had sent to other peers but isn't
available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately. I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is. How would
you imagine using any version of this feature in a PITR scenario? If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point? But I
can't quite follow what you're thinking about. Can you explain
further?

Requiring live replication connections could also be an issue for service
interruptions, surely? Unless you persist needed knowledge in the physical
replication slot used by the standby to master connection, so the master can
tell the difference between "downstream went away for while but will come
back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot. If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately. When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

Also, what about cascading? Lots of "pull" model designs I've looked at tend
to fall down in cascaded environments. For that matter so do failover slots,
but only for the narrower restriction of not being able to actually decode
from a failover-enabled slot on a standby, they still work fine in terms of
cascading down to leaf nodes.

I don't see the problem. The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby. Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master. Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like. With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

I don't want to come on too strong here. I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge. If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons. I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok". Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#39)
Re: WIP: Failover Slots

On 9 August 2017 at 23:42, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com>
wrote:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.

Wouldn't that immediately exclude use for PITR and snapshot recovery? I

have

people right now who want the ability to promote a PITR-recovered

snapshot

into place of a logical replication master and have downstream peers

replay

from it. It's more complex than that, as there's a resync process

required

to recover changes the failed node had sent to other peers but isn't
available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately. I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is. How would
you imagine using any version of this feature in a PITR scenario? If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.

I've realised that it's possible to work around it in app-space anyway. You
create a new slot on a node before you snapshot it, and you don't drop this
slot until you discard the snapshot. The existence of this slot ensures
that any WAL generated by the node (and replayed by PITR after restore)
cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or
have some other safeguard in place, which we need for logical decoding on
standby to be safe anyway, then we can fail gracefully if the user does
something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example,
in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from
our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin
filter for "originating from me", for xacts from the divergence lsn up to
the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume
replication

To enable that to be possible, since we can't rewind slots once confirmed
advanced, maintain a backup slot on the peer corresponding to the
point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the
slot before you permit write user-initiated write xacts to begin on the
restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this
something worth caring about, should that be the case. It's niche enough
that you could argue it doesn't have to be supportable in stock postgres.

Maybe you're thinking of a scenario where we PITR the master and also

use PITR to rewind the replica to a slightly earlier point?

That can work, but must be done in lock-step. You have to pause apply on
both ends for long enough to snapshot both, otherwise the replicaion
origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.

But I
can't quite follow what you're thinking about. Can you explain
further?

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes
quite hard to see the outside perspective, so thanks for your patience.

Requiring live replication connections could also be an issue for service
interruptions, surely? Unless you persist needed knowledge in the

physical

replication slot used by the standby to master connection, so the master

can

tell the difference between "downstream went away for while but will come
back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot. If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately. When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

OK, so you're envisioning that every slot on a downstream has a mirror slot
on the upstream, and that is how the master retains the needed resources.

Also, what about cascading? Lots of "pull" model designs I've looked at

tend

to fall down in cascaded environments. For that matter so do failover

slots,

but only for the narrower restriction of not being able to actually

decode

from a failover-enabled slot on a standby, they still work fine in terms

of

cascading down to leaf nodes.

I don't see the problem. The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby. Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
correspondence, and using globally unique slot names to manage state.

If slot names collide, you presumably fail with "er, don't do that then".
Or scrambling data horribly. Both of which we certainly have precedent for
in Pg (see, e.g, what happens if two snapshots of the same node are in
archive recovery and promote to the same timeline, then start archiving to
the same destination...). So not a showstopper.

I'm pretty OK with that.

Also, as Andres pointed out upthread, if the state is passed through

the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.

Yes, that's my main hesitation with the current failover slots, as
mentioned in the prior message.

Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like. With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

Definitely agreed.

Different standbys don't know about each other so it's the user's job to
make sure they ensure uniqueness, using slot name as a key.

I don't want to come on too strong here. I'm very willing to admit

that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge.

The flip side is that I've also been staring at the problem, on and off,
for WAY too long. So other perspectives can be really valuable.

If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons. I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok". Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

I mostly agree there. We could have relatively easily converted WAL-based
failover slots to something else in a major version bump, and that's why I
wanted to get them in place for 9.6 and then later for pg10. Because people
were (and are) constantly asking me and others who work on logical
replication tools why it doesn't work, and a 90% solution that doesn't
paint us into a corner seemed just fine.

I'm quite happy to find a better one. But I cannot spend a lot of time
writing something to have it completely knocked back because the scope just
got increased again and now it has to do more, so it needs another rewrite.

So, how should this look if we're using the streaming rep protocol?

How about:

A "failover slot" is identified by a field in the slot struct and exposed
in pg_replication_slots. It can be null (not a failover slots). It can
indicate that the slot was created locally and is "owned" by this node; all
downstreams should mirror it. It can also indicate that it is a mirror of
an upstream, in which case clients may not replay from it until it's
promoted to an owned slot and ceases to be mirrored. Attempts to replay
from a mirrored slot just ERROR and will do so even once decoding on
standby is supported.

This promotion happens automatically if a standby is promoted to a master,
and can also be done manually via sql function call or walsender command to
allow for an internal promotion within a cascading replica chain.

When a replica connects to an upstream it asks via a new walsender msg
"send me the state of all your failover slots". Any local mirror slots are
updated. If they are not listed by the upstream they are known deleted, and
the mirror slots are deleted on the downstream.

The upstream walsender then sends periodic slot state updates while
connected, so replicas can advance their mirror slots, and in turn send
hot_standby_feedback that gets applied to the physical replication slot
used by the standby, freeing resources held for the slots on the master.

There's one big hole left here. When we create a slot on a cascading leaf
or inner node, it takes time for hot_standby_feedback to propagate the
needed catalog_xmin "up" the chain. Until the master has set the needed
catalog_xmin on the physical slot for the closest branch, the inner node's
slot's catalog_xmin can only be tentative pending confirmation. That's what
a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you alluded
to: when you create an "owned" slot on a replica, it tells the master at
connect time / slot creation time "I have this slot X, please copy it up
the tree". The slot gets copied "up" to the master via cascading layers
with a different failover slot type indicating it's an up-mirror. Decoding
clients aren't allowed to replay from an up-mirror slot and it cannot be
promoted like a down-mirror slot can, it's only there for resource
retention. A node knows its owned slot is safe to actually use, and is
fully created, when it sees the walsender report it in the list of failover
slots from the master during a slot state update.

This imposes some restrictions:

* failover slot names must be globally unique or things go "kaboom"
* if a replica goes away, its up-mirror slots stay dangling until the admin
manually cleans them up

Tolerable, IMO. But we could fix the latter by requiring that failover
slots only be enabled when the replica uses a physical slot to talk to the
upstream. The up-mirror failover slots then get coupled to the physical
slot by an extra field in the slot struct holding the name of the owning
physical slot. Dropping that physical slot cascade-drops all up-mirror
slots automatically. Admins are prevented from dropping up-mirror slots
manually, which protects against screwups.

We could even fix the naming, maybe, with some kind of qualified naming
based on the physical slot, but it's not worth the complexity.

It sounds a bit more complex than your sketch, but I think the 4
failover-kinds are necessary to support this. We'll have:

* not a failover slot, purely local

* a failover slot owned by this node (will be usable for decoding on
standby once supported)

* an up-mirror slot, not promoteable, resource retention only, linked to a
physical slot for a given replica

* a down-mirror slot, promoteable, not linked to a physical slot; this is
the true "failover slot"'s representation on a replica.

Thoughts? Feels pretty viable to me.

Thanks for the new perspective.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#41Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#40)
Re: WIP: Failover Slots

On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining. (Whether I should instead be
imagining something else is the important question.)

I'm quite happy to find a better one. But I cannot spend a lot of time
writing something to have it completely knocked back because the scope just
got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that. I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are). Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally. I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up. Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

A "failover slot" is identified by a field in the slot struct and exposed
in pg_replication_slots. It can be null (not a failover slots). It can
indicate that the slot was created locally and is "owned" by this node; all
downstreams should mirror it. It can also indicate that it is a mirror of an
upstream, in which case clients may not replay from it until it's promoted
to an owned slot and ceases to be mirrored. Attempts to replay from a
mirrored slot just ERROR and will do so even once decoding on standby is
supported.

+1

This promotion happens automatically if a standby is promoted to a master,
and can also be done manually via sql function call or walsender command to
allow for an internal promotion within a cascading replica chain.

+1.

When a replica connects to an upstream it asks via a new walsender msg "send
me the state of all your failover slots". Any local mirror slots are
updated. If they are not listed by the upstream they are known deleted, and
the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

The upstream walsender then sends periodic slot state updates while
connected, so replicas can advance their mirror slots, and in turn send
hot_standby_feedback that gets applied to the physical replication slot used
by the standby, freeing resources held for the slots on the master.

+1.

There's one big hole left here. When we create a slot on a cascading leaf or
inner node, it takes time for hot_standby_feedback to propagate the needed
catalog_xmin "up" the chain. Until the master has set the needed
catalog_xmin on the physical slot for the closest branch, the inner node's
slot's catalog_xmin can only be tentative pending confirmation. That's what
a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you alluded
to: when you create an "owned" slot on a replica, it tells the master at
connect time / slot creation time "I have this slot X, please copy it up the
tree". The slot gets copied "up" to the master via cascading layers with a
different failover slot type indicating it's an up-mirror. Decoding clients
aren't allowed to replay from an up-mirror slot and it cannot be promoted
like a down-mirror slot can, it's only there for resource retention. A node
knows its owned slot is safe to actually use, and is fully created, when it
sees the walsender report it in the list of failover slots from the master
during a slot state update.

I'm not sure that this actually prevents the problem you describe. It
also seems really complicated. Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#41)
Re: WIP: Failover Slots

On 11 August 2017 at 01:02, Robert Haas <robertmhaas@gmail.com> wrote:

Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

Agreed, patches materializing doesn't mean they should be committed, and
there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're
doing a good job of it here.

When a replica connects to an upstream it asks via a new walsender msg

"send

me the state of all your failover slots". Any local mirror slots are
updated. If they are not listed by the upstream they are known deleted,

and

the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

Yes, it'll also need to send a list of its local owned and up-mirrored
failover slots to the upstream so the upstream can create them or update
their state.

There's one big hole left here. When we create a slot on a cascading leaf
or

inner node, it takes time for hot_standby_feedback to propagate the

needed

catalog_xmin "up" the chain. Until the master has set the needed
catalog_xmin on the physical slot for the closest branch, the inner

node's

slot's catalog_xmin can only be tentative pending confirmation. That's

what

a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you

alluded

to: when you create an "owned" slot on a replica, it tells the master at
connect time / slot creation time "I have this slot X, please copy it up

the

tree". The slot gets copied "up" to the master via cascading layers with

a

different failover slot type indicating it's an up-mirror. Decoding

clients

aren't allowed to replay from an up-mirror slot and it cannot be promoted
like a down-mirror slot can, it's only there for resource retention. A

node

knows its owned slot is safe to actually use, and is fully created, when

it

sees the walsender report it in the list of failover slots from the

master

during a slot state update.

I'm not sure that this actually prevents the problem you describe. It
also seems really complicated. Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

It probably sounds more complex than it is. A slot is created tentatively
and marked not ready to actually use yet when created on a standby. It
flows "up" to the master where it's created as permanent/ready. The
permanent/ready state flows back down to the creator.

When we see a temp slot become permanent we copy the
restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case
the master had to advance them from our tentative values when it created
the slot. After that, slot state updates only flow "out" from the owner: up
the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby
case, since I think we're happy with the rest already and I don't want to
complicate it.

GMail will probably HTMLize this, sorry:

                          Phys rep          Phys rep
                          using phys        using
                          slot "B"          phys slot "C"
                +-------+         +--------+         +-------+
 T              |  A    <^--------+ B      <---------+ C     |
 I              |       |         |        |         |       |
 M              +-------+         +--------+         +-------+
 E                 |                  |                  |
 |                 |                  |                  |CREATEs
 |                 |                  |                  |logical slot X
 v                 |                  |                  |("owned")
                   |                  |                  |as temp slot
                   |                  +<-----------------+
                   |                  |Creates upmirror  |
                   |                  |slot "X" linked   |
                   |                  |to phys slot "C"  |
                   |                  |marked temp       |
                   | <----------------+                  |
                   |Creates upmirror  |                  |
<--------------------------+   +-----------------+
                   |slot "X" linked   |                  |   Attempt to
decode from "X"   |                 |
                   |to phys slot "B"  |                  |
               | CLIENT          |
                   |marked permanent  |                  |
 +------------------------->   |                 |
                   +----------------> |                  |   ERROR: slot X
still being    +-----------------+
                   |                  |Sees upmirror     |   created on
master, not ready
                   |                  |slot "X" in       |
                   |                  |list from "A",    |
                   |                  |marks it          |
                   |                  |permanent and     |
                   |                  |copies state      |
                   |                  +----------------> |
                   |                  |                  |Sees upmirror slot
                   |                  |                  |"X" on "B" got
marked
                   |                  |                  |permanent
(because it
                   |                  |                  |appears in B's
slot
                   |                  |                  |listings),
                   |                  |                  |marks permanent
on C.
                   |                  |                  |Copies state.
                   |                  |                  |
                   |                  |                  |Slot "X" now
persistent
                   |                  |                  |and (when
decoding on standby
                   |                  |                  |supported) can be
used for decoding
                   |                  |                  |on standby.
                   +                  +                  +

(also avail as
https://gist.github.com/ringerc/d4a8fe97f5fd332d8b883d596d61e257 )

To actually use the slot once decoding on standby is supported: a decoding
client on "C" can consume xacts and cause slot "X" to advance catalog_xmin,
confirmed_flush_lsn, etc. walreceiver on "C" will tell walsender on "B"
about the new slot state, and it'll get synced up-tree, then B will tell A.

Since slot is already marked permanent, state won't get copied back
down-tree, that only happens once when slot is first fully created on
master.

Some node "D" can exist as a phys rep of "C". If C fails and is replace
with D, admin can promote the down-mirror slot on "D" to an owned slot.

Make sense?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#43Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#37)
Re: WIP: Failover Slots

On 2017-08-02 16:35:17 -0400, Robert Haas wrote:

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that. Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above. It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
slots do *NOT* provide proper PITR support, there's not enough
interlock easily available (you'd have to save slots at the end, then
increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
standbys, and failover slots. I'm probably a lot closer to closer than
than Craig is.
- How much divergence are we going to accept between infrastructure for
logical failover, and logical failover via failover slots (or however
we're naming this)? Again, I'm probably a lot closer to zero than
craig is.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#43)
Re: WIP: Failover Slots

On 12 August 2017 at 08:03, Andres Freund <andres@anarazel.de> wrote:

On 2017-08-02 16:35:17 -0400, Robert Haas wrote:

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that. Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.

How could it be incompatible? The idea here is to make physical failover
transparent to logical decoding clients. That's not meant to sound
confrontational, I mean that I can't personally see any way it would be and
could use your ideas.

I understand that it might be *different* and you'd like to see more
closely aligned approaches that work more similarly. For which we first
need to know more clearly how logical failover will look. But it's hard not
to also see this as delaying and blocking until your preferred approach via
pure logical rep and logical failover gets in, and physical failover can be
dismissed with "we don't need that anymore". I'm sure that's not your
intent, I just struggle not to see it that way anyway when there's always
another reason not to proceed to solve this problem because of a loosely
related development effort on another problem.

I think there's a couple design goals we need to agree upon, before

going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
available everywhere, or not?

Personally, I don't care either way.

- What kind of PITR integration do we want? Note that simple WAL based
slots do *NOT* provide proper PITR support, there's not enough
interlock easily available (you'd have to save slots at the end, then
increment minRecoveryLSN to a point later than the slot saving)

Interesting. I haven't fully understood this, but think I see what you're
getting at.

As outlined in the prior mail, I'd like to have working PITR with logical
slots but think it's pretty niche as it can't work usefully without plenty
of co-operation from the rest of the logical replication software in use.
You can't just restore and resume normal operations. So I don't think it's
worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of
catalog_xmin on the running master for the life of the PITR base backup
using a slot for retention. There's plenty of room for operator error
until/unless we add something like catalog_xmin advance xlog'ing, but it
can be done now with external tools if you're careful. Details in the prior
mail.

I don't think PITR for logical slots is important given there's a
workaround and it's not simple to actually do anything with it if you have
it.

- How much divergence are we going to accept between logical decoding on
standbys, and failover slots. I'm probably a lot closer to closer than
than Craig is.

They're different things to me, but I think you're asking "to what extent
should failover slots functionality be implemented strictly on top of
decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client
can expect a slot it creates on a master (or physical streaming replica
doing decoding on standby) to continue to exist. The client can ignore
physical HA and promotions of the master, which can continue to be managed
using normal postgres tools. It's the same as, say, an XA transaction
manager expecting that if your master dies and you fail over to a standby,
the TM should't have to have been doing special housekeeping on the
promotion candidate before promotion in order for 2PC to continue to work.
It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as
you can use it to extract data from a replica, and now decoding timeline
following is in a decoding connection on a replica will survive promotion
to master.

But in addition to its main purpose of allowing logical decoding from a
standby server to offload work, it can be used to implement client-managed
support for failover to physical replicas. For this, the client must have
an inventory of promotion-candidates of the master and their connstrings so
it can maintain slots on them too. The client must be able to connect to
all promotion-candidates and advance their slots via decoding along with
the master slots it's actually replaying from. If a client isn't "told"
about a promotion candidate, decoding will break when we fail over. If a
client cannot connect to a promotion candidate, catalog_xmin will fall
behind on master until the replica is discarded (and its physical slot
dropped) or the client regains access. Every different logical decoding
client application must implement all this logic and management separately.

It may be possible to implement failover-slots like functionality based on
decoding on standby in an app transparent way, by having the replica
monitor slot states on the master and self-advance its own slots by
loopback decoding connection. Or the master could maintain an inventory of
replicas and make decoding connections to them where it advances their
slots after the masters' slots are advanced by an app. But either way, why
would we want to do this? Why actually decode WAL and use the logical
decoding machinery when we *know* the state of the system because only the
master is writeable?

The way I see it, to provide failover slots functionality we'd land up with
something quite similar to what Robert and I just discussed, but the slot
advance would be implemented using decoding (on standby) instead of
directly setting slot state. What benefit does that offer?

I don't want to block failover slots on decoding on standby just because
decoding on standby would be nice to have.

- How much divergence are we going to accept between infrastructure for
logical failover, and logical failover via failover slots (or however
we're naming this)? Again, I'm probably a lot closer to zero than
craig is.

We don't have logical failover, let alone mature, tested logical failover
that covers most of Pg's available functionality. Nor much of a design for
it AFAIK. There is no logical failover to diverge from, and I don't want to
block physical failover support on that.

But, putting that aside to look at the details of how logical failover
might work, what sort of commonality do you expect to see? Physical
failover is by WAL replication using archive recovery/streaming, managed
via recovery.conf, with unilateral promotion by trigger file/command. The
admin is expected to ensure that any clients and cascading replicas get
redirected to the promoted node and the old one is fenced - and we don't
care if that's done by IP redirection or connstring updates or what. Per
the proposal Robert and I discussed, logical slots will be managed by
having the walsender/walreceiver exchange slot state information that
cascades up/down the replication tree via mirror slot creations.

How's logical replica promotion going to work? Here's one possible way, of
many: the promotion-candidate logical replica consumes an unfiltered xact
stream that contains changes from all nodes, not just its immediate
upstream. Downstreams of the master can maintain direct connections to the
promotion candidate and manage their own slots directly, sending flush
confirmations for slots on the promotion candidate as they see their
decoding sessions on the replica decode commits for LSNs the clients sent
flush confirmations to the master for. On promotion, the master's
downstreams would be reconfigured to connect to the node-id of the newly
promoted master and would begin decoding from it in catchup mode, where
they receive the commits from the old master via the new master, until they
reach the new master's end-of-wal at time of promotion. With some tweaks
like a logical WAL message recording the moment of promotion, it's not that
different to the client-managed physical failover model.

It can also be converted to a more transparent failover-slots like model by
having the promotion candidate physical replica clone slots from its
upstream, but advance them by loopback decoding - not necessarily actual
network loopback. It'd use a filter that discards data and only sees the
commit XIDs + LSNs. It'd send confirmations on the slots when the local
slot processed a commit for which the upstream's copy of the slot had a
confirmation for that lsn. On promotion, replicas would connect with new
replorigins (0) and let decoding start at the slot positions on the
replica. The master->replica slot state reporting can be done via the
walsender too, just as proposed for the physical case, though no
replica->master reporting would be needed for logical failover.

So despite my initial expectations they can be moderately similar in broad
structure. But I don't think there's going to be much actual code overlap
beyond minor things like both wanting a way to query slot state on the
upstream. Both *could* use decoding on standby to advance slot positions,
but for the physical case that's just a slower (and unfinished) way to do
what we already have, wheras it's necessary for logical failover.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#45Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#44)
Re: WIP: Failover Slots

On 14 August 2017 at 11:56, Craig Ringer <craig@2ndquadrant.com> wrote:

I don't want to block failover slots on decoding on standby just because
decoding on standby would be nice to have.

However, during discussion with Tomas Munro a point has come up that does
block failover slots as currently envisioned - silent timeline divergence.
It's a solid reason why the current design and implementation is
insufficient to solve the problem. This issue exists both with the original
failover slots and with the model Robert and I were discussing.

Say a decoding client has replayed from master up to commit of xid 42 at
1/1000 and confirmed flush, then a failover slots standby of the master is
promoted. The standby has only received WAL from the failed master up to
1/500 with most recent xid 20. Now the standby does some other new xacts,
pushing xid up to 30 at 1/1000 then continuing to insert until xid 50 at
lsn 1/2000.

Then the logical client reconnects. The logical client will connect to the
failover slot fine, and start replay. But it'll ask for replay to start at
1/1000. The standby will happily fast-forward the slot (as it should), and
start replay after 1/1000.

But now we have silent divergence in timelines. The logical replica has
received and committed xacts 20...42 at lsn 1/500 through 1/1000, but these
are not present on the promoted master. And the replica has skipped over
the new-master's xids 20...30 with lsns 1/500 through 1/1000, so they're
present on the new master but not the replica.

IMO, this shows that not including the timeline in replication origins was
a bit of a mistake, since we'd trivially detect this if they were included
- but it's a bit late now. And anyway, detection would just mean logical
rep would break, which doesn't help much.

The simplest fix, but rather limited, is to require that failover
candidates be in synchronous_standby_names, and delay ReorderBufferCommit
sending the actual commit message until all peers in s_s_n confirm flush of
the commit lsn. But that's not much good if you want sync rep for your
logical connections too, and is generally a hack.

A more general solution requires that masters be told which peers are
failover candidates, so they can ensure ordering between logical decoding
and physical failover candidates. Which effectively adds another kind of
sync rep, where we do "wait for physical failover candidates to flush, and
only then allow logical decoding". This actually seems pretty practical
with the design Robert and I discussed, but it's definitely an expansion in
scope.

Alternately, we could require the decoding clients to keep an eye on the
flush/replay positions of all failover candidates and delay commit+confirm
of decoded xacts until the upstream's failover candidates have received and
flushed up to that lsn. Theat starts to look at lot like a decoding on
standby based model for logical failover, where the downstream maintains
slots on each failover candidate upstream.

So yeah. More work needed here. Even if we suddenly decided the original
failover slots model was OK, it's not sufficient to fully solve the problem.

(It's something I'd thought for BDR failover, but never applied to falover
slots: the problem of detecting or preventing divergence when the logical
client is ahead of physical receive at the time the physical standby is
promoted.)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services